Rectangular Flows for Manifold Learning

by   Anthony L. Caterini, et al.
Layer 6 AI
Columbia University

Normalizing flows are invertible neural networks with tractable change-of-volume terms, which allows optimization of their parameters to be efficiently performed via maximum likelihood. However, data of interest is typically assumed to live in some (often unknown) low-dimensional manifold embedded in high-dimensional ambient space. The result is a modelling mismatch since – by construction – the invertibility requirement implies high-dimensional support of the learned distribution. Injective flows, mapping from low- to high-dimensional space, aim to fix this discrepancy by learning distributions on manifolds, but the resulting volume-change term becomes more challenging to evaluate. Current approaches either avoid computing this term entirely using various heuristics, or assume the manifold is known beforehand and therefore are not widely applicable. Instead, we propose two methods to tractably calculate the gradient of this term with respect to the parameters of the model, relying on careful use of automatic differentiation and techniques from numerical linear algebra. Both approaches perform end-to-end nonlinear manifold learning and density estimation for data projected onto this manifold. We study the trade-offs between our proposed methods, empirically verify that we outperform approaches ignoring the volume-change term by more accurately learning manifolds and the corresponding distributions on them, and show promising results on out-of-distribution detection.


page 1

page 2

page 3

page 4


Diagnosing and Fixing Manifold Overfitting in Deep Generative Models

Likelihood-based, or explicit, deep generative models use neural network...

Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows

Normalizing flows are generative models that provide tractable density e...

Joint Manifold Learning and Density Estimation Using Normalizing Flows

Based on the manifold hypothesis, real-world data often lie on a low-dim...

VQ-Flows: Vector Quantized Local Normalizing Flows

Normalizing flows provide an elegant approach to generative modeling tha...

Relative gradient optimization of the Jacobian term in unsupervised deep learning

Learning expressive probabilistic models correctly describing the data i...

Minimax rates for cost-sensitive learning on manifolds with approximate nearest neighbours

We study the approximate nearest neighbour method for cost-sensitive cla...

Nonlinear Isometric Manifold Learning for Injective Normalizing Flows

To model manifold data using normalizing flows, we propose to employ the...

1 Introduction

In recent years, Normalizing Flows (NFs) have become a staple of generative modelling, being widely used for density estimation (Dinh et al., 2014, 2017; Papamakarios et al., 2017; Kingma and Dhariwal, 2018; Durkan et al., 2019), variational inference (Rezende and Mohamed, 2015; Kingma et al., 2016), maximum entropy modelling (Loaiza-Ganem et al., 2017), and more (Papamakarios et al., 2019; Kobyzev et al., 2020). In density estimation, we typically have access to a set of points living in some high-dimensional space . NFs model the corresponding data-generating distribution as the pushforward of a simple distribution on – often a Gaussian – through a smooth bijective mapping. Clever construction of these bijections allows for tractable density evaluation and thus maximum likelihood estimation of the parameters. However, as an immediate consequence of this choice, the learned distribution has support homeomorphic to ; in particular, the resulting distribution is supported on a set of dimension

. This is not a realistic assumption in practice – especially for density estimation – as it directly contradicts the manifold hypothesis

(Bengio et al., 2013)

which states that high-dimensional data lives on a lower-dimensional manifold embedded in ambient space.

A natural idea to circumvent this misspecification is to consider injective

instead of bijective flows, which now push forward a random variable on

with to obtain a distribution on some -dimensional manifold embedded in . These mappings admit a change-of-variable formula bearing resemblance to that of bijective flows, but unfortunately the volume-change term becomes computationally prohibitive, which then impacts the tractability of maximum likelihood. While there have been recent efforts towards training flows where the resulting distribution is supported on a low-dimensional manifold (Gemici et al., 2016; Rezende et al., 2020; Brehmer and Cranmer, 2020; Kumar et al., 2020; Mathieu and Nickel, 2020; Cunningham et al., 2020), these approaches either assume that the manifold is known beforehand or propose various heuristics to avoid the change-of-variable computation. Both of these are undesirable, because, while we should expect most high-dimensional data of interest to exhibit low-dimensional structure, this structure is almost always unknown. On the other hand, we argue that avoiding the volume-change term may result in learning a manifold to which it is difficult to properly assign density, and this approach further results in methods which do not take advantage of density evaluation, undermining the main motivation for using NFs in the first place.

We show that density estimation for injective flows based on maximum likelihood can be made tractable. By carefully leveraging forward- and backward-mode automatic differentiation (Baydin et al., 2018)

, we propose two methods that allow backpropagating through the volume term arising from the injective change-of-variable formula. The first method involves exact evaluation of this term and its gradient which incurs a higher memory cost; the second uses conjugate gradients

(Nocedal and Wright, 2006) and Hutchinson’s trace estimator (Hutchinson, 1989) to obtain unbiased stochastic gradient estimates. Unlike previous work, our methods do not need the data manifold to be specified beforehand, and simultaneously estimate this manifold along with the distribution on it end-to-end, thus enabling maximum likelihood training to occur. To the best of our knowledge, ours are the first methods to scale backpropagation through the injective volume-change term to ambient dimensions close to

. We study the trade-off between memory and variance introduced by our methods and show empirical improvements over injective flow baselines for density estimation. We also show that injective flows obtain state-of-the-art performance for likelihood-based Out-of-Distribution (OoD) detection, assigning higher likelihoods to Fashion-MNIST (FMNIST)

(Xiao et al., 2017) than to MNIST (LeChun, 1998) with a model trained on the former.

2 Background

2.1 Square Normalizing Flows

A normalizing flow (Rezende and Mohamed, 2015; Dinh et al., 2017) is a diffeomorphism parametrized by , that is, a differentiable bijection with differentiable inverse. Starting with a random variable for a simple density supported on , e.g. a standard Gaussian, the change-of-variable formula states that the random variable has density on given by:


where is the differentiation operator, so that is the Jacobian of (with respect to the inputs and not ) evaluated at . We refer to this now standard setup as square flows since the Jacobian is a square matrix. The change-of-variable formula is often written in terms of the Jacobian of , but we use the form of (1) as it is more applicable for the next section. NFs are typically constructed in such a way that not only ensures bijectivity, but also so that the Jacobian determinant in (1) can be efficiently evaluated. When provided with a dataset , an NF models its generating distribution as the pushforward of through , and thus the parameters can be estimated via maximum likelhood as .

2.2 Rectangular Normalizing Flows

As previously mentioned, square NFs unrealistically result in the learned density having -dimensional support. We follow the injective flow construction of Brehmer and Cranmer (2020), where a smooth and injective mapping with is constructed. In this setting, is the low-dimensional variable used to model the data as . A well-known result from differential geometry (Krantz and Parks, 2008) provides an applicable change-of-variable formula:


where . The Jacobian-transpose-Jacobian determinant now characterizes the change in volume from to . We make several relevant observations: The Jacobian matrix is no longer a square matrix, and we thus refer to these flows as rectangular. Note that is only properly defined on and not , and is now supported on the -dimensional manifold . We write the indicator explicitly to highlight the fact that this density is not a density with respect to the Lebesgue measure; rather, the dominating measure is the Riemannian measure on the manifold (Pennec, 2006). One can clearly verify as a sanity check that when , equation (2) reduces to (1).

Since data points will almost surely not lie exactly on , we use a left inverse in place of such that for all , which exists because is injective. This is properly defined on , unlike which only exists over . Equation (2) then becomes:


Note that (3) is equivalent to projecting onto as , and then evaluating the density from (2) at the projected point.

Now, is injectively constructed as follows:


where and are both square flows, , and and are defined as and , where . Now, depends only on and not , so we write it as from now on. Applying (3) yields:


where and . We include a derivation of (5) in Appendix A, along with a note on why injective transformations cannot be stacked as naturally as bijective ones.

Evaluating likelihoods is seemingly intractable since constructing flows with a closed-form volume-change term is significantly more challenging than in the square case, even if the relevant matrix is now instead of . Brehmer and Cranmer (2020) thus propose a two-step training procedure to promote tractability wherein and are trained separately. After observing that there is no term encouraging , and that , they decide to simply train by minimizing the reconstruction error to encourage the observed data to lie on :


Note that the above requires computing both and , so that should be chosen as a flow allowing fast evaluation of both and . Architectures such as the Real NVP (Dinh et al., 2017) or follow-up work (Kingma and Dhariwal, 2018; Durkan et al., 2019) are thus natural choices for , while architectures with an autoregressive component (Papamakarios et al., 2017; Kingma et al., 2016) should be avoided. Then, since does not appear in the challenging determinant term in (5), can be chosen as any normalizing flow, and optimization – for a fixed – can be tractably achieved by maximum likelihood over the lower-dimensional space:


In practice, gradients steps in and are alternated. This entire procedure circumvents evaluation of the Jacobian-transpose-Jacobian determinant term in (5), but as we show in section 3, avoiding this term by separately learning the manifold and the density on it comes with its downsides. We then show how to tractably estimate this term in section 4.

3 Related Work and Motivation

Low-dimensional and topological pathologies

The mismatch between the dimension of the modelled support and that of the data-generating distribution has been observed throughout the literature in different ways. Dai and Wipf (2019)

show, in the context of variational autoencoders

(Kingma and Welling, 2014), that using flexible distributional approximators supported on to model data living in a low-dimensional manifold results in pathological behavior where the manifold itself is learned, but not the distribution on it. Cornish et al. (2020) demonstrate the drawbacks of using normalizing flows for estimating the density of topologically-complex data, and provide a new method learning NFs with supports which are not homeomorphic to , but they still model the support as being -dimensional. Behrmann et al. (2021) show numerical instabilities associated with NFs – particularly a lack of numerical invertibility, as also explained theoretically by Cornish et al. (2020). This is not too surprising, as attempting to learn a smooth invertible function mapping to some low-dimensional manifold is an intrinsically ill-posed problem. This body of work strongly motivates the development of models whose support has matching topology – including dimension – to that of the true data distribution.

Manifold flows

A challenge to overcome for obtaining NFs on manifolds is the Jacobian-transpose-Jacobian determinant computation. Current approaches for NFs on manifolds approach this challenge in one of two ways. The first assumes the manifold is known beforehand (Gemici et al., 2016; Rezende et al., 2020; Mathieu and Nickel, 2020), severely limiting its general applicability to low-dimensional data where the true manifold can realistically be known. The second group circumvents the computation of the Jacobian-transpose-Jacobian entirely through various heuristics. Kumar et al. (2020) use a potentially loose lower bound of the log-likelihood, and do not explicitly enforce injectivity, resulting in a method for which the change-of-variable almost surely does not hold. Cunningham et al. (2020) propose to convolve the manifold distribution with Gaussian noise, which results in the model having high-dimensional support. Finally, Brehmer and Cranmer (2020) propose the method we described in subsection 2.2, where manifold learning and density estimation are done separately in order to avoid the log determinant computation.

Why optimize the volume-change term?

Learning and separately without the Jacobian of is concerning: even if maps to the correct manifold, it might unnecessarily expand and contract volume in such a way that makes correctly learning much more difficult than it needs to be. Looking ahead to our experiments, Figure 1 exemplifies this issue: the top-middle panel shows the ground truth density on a 1-dimensional circle in , and the top-right panel the distribution recovered by the two-step method of Brehmer and Cranmer (2020). We can see that, while the manifold is correctly recovered, the distribution on it is not. The bottom-right panel shows the speed at which maps to : the top of the circle, which should have large densities, also has high speeds. Indeed, there is nothing in the objective discouraging to learn this behaviour, which implies that the corresponding low-dimensional distribution must be concentrated in a small region and thus making it harder to learn. The bottom-middle panel confirms this explanation: the learned low-dimensional distribution (dark red) does not match what it should (i.e. the distribution of , in light red). This failure could have been avoided by learning the manifold in a density-aware fashion by including the Jacobian-transpose-Jacobian determinant in the objective.

4 Maximum Likelihood for Rectangular Flows: Taming the Gradient

4.1 Our Optimization Objective

We have argued that including the Jacobian-transpose-Jacobian in the optimization objective is sensible. However, as we previously mentioned, (5) corresponds to the density of the projection of onto . Thus, simply optimizing the likelihood would not result in learning in such a way that observed data lies on it, only encouraging projected data points to have high likelihood. We thus maximize the log-likelihood subject to the constraint that the reconstruction error should be smaller than some threshold. In practice, we use the KKT conditions (Karush, 1939; Kuhn and Tucker, 1951) and maximize the Lagrangian:


where we treat

as a hyperparameter, and denote

as for simplicity. We have dropped the absolute value since is always symmetric positive definite, since has full rank by injectivity of . We now make a technical but relevant observation about our objective: since our likelihoods are Radon-Nikodym derivatives with respect to the Riemannian measure on , different values of will result in different dominating measures. One should thus be careful to compare likelihoods for models with different values of . However, thanks to the smoothness of the objective over , we should expect likelihoods for values of which are “close enough” to be comparable for practical purposes. In other words, comparisons remain reasonable locally, and the gradient of the volume-change term should still contain relevant information that helps learning in such a way that can easily learn a density on the pulled-back dataset .

4.2 Optimizing our Objective: Stochastic Gradients

Note that all the terms in (8) are straightforward to evaluate and backpropagate through except for the third one; in this section we show how to obtain unbiased stochastic estimates of its gradient. In what follows we drop the dependence of the Jacobian on from our notation and write , with the understanding that the end computation will be parallelized over a batch of

s. We assume access to an efficient matrix-vector product routine, i.e. computing

can be quickly achieved for any . We elaborate on how we obtain these matrix-vector products in the next section. It is a well known fact from matrix calculus (Petersen and Pedersen, 2008) that:


where denotes the trace operator and is the -th element of . Next, we can use Hutchinson’s trace estimator (Hutchinson, 1989), which states that for any matrix , for any -valued random variable with zero mean and identity covariance matrix. We can thus obtain an unbiased stochastic estimate of our gradient as:


where are typically sampled either from standard Gaussian or Rademacher distributions. Naïve computation of the above estimate remains intractable without explicitly constructing . Fortunately, the terms can be trivially obtained using the given matrix-vector product routine, avoiding the construction of , and then follows by taking the gradient w.r.t. .

There is however still the issue of computing . We use conjugate gradients (CG) (Nocedal and Wright, 2006) in order to achieve this. CG is an iterative method to solve problems of the form for given (in our case ) and ; we include the CG algorithm in Appendix B for completeness. CG has several important properties. First, it is known to recover the solution (assuming exact arithmetic) after at most steps, which means we can evaluate . The solution converges exponentially (in the number of iterations ) to the true value (Shewchuk et al., 1994), so often iterations are sufficient for accuracy to many decimal places. In practice, if we can tolerate a certain amount of bias, we can further increase computational speed by stopping iterations early. Second, CG only requires a method to compute matrix-vector products against , and does not require access to itself. One such product is performed at each iteration, and CG thus requires at most matrix-vector products, though again in practice products usually suffice. This results in solve complexity—less than the required by direct inversion methods. We denote computed with conjugate gradients as . We can then compute the estimator from (10) as:


In practice, we implement this term by noting that , thereby taking advantage of the stop_gradient operation from Automatic Differentiation (AD) libraries and allowing us to avoid implementing a custom backward pass. We thus compute the contribution of a point to the training objective as:


which gives the correct gradient estimate when taking the derivative with respect to .

Linear solvers for Jacobian terms

We note that linear solvers like CG have been used before to backpropagate through log determinant computations in the context of Gaussian processes (Gardner et al., 2018), and more recently for square NFs with flexible architectures which do not allow for straightforward Jacobian determinant computations (Huang et al., 2021; Lu et al., 2021). However, none of these methods require the Jacobian-transpose-Jacobian-vector product routine that we derive in the next section, and to the best of our knowledge, these techniques have not been previously applied for training rectangular NFs. We also point out that recently Oktay et al. (2021) proposed a method to efficiently obtain stochastic estimates of . While their method cannot be used as a drop-in replacement within our framework as it would result in a biased CG output, we believe this could be an interesting direction for future work. Finally, we note that CG has recently been combined with the Russian roulette estimator (Kahn, 1955) to avoid having to always iterate times while maintaining unbiasedness, again in the context of Gaussian processes (Potapczynski et al., 2021). We also leave the exploration of this estimator within our method for future work.

4.3 AD Considerations: The Exact Method and the Forward-Backward AD Trick

In this section we derive the aforementioned routine for vector products against , as well as an exact method that avoids the need for stochastic gradients (for a given ) at the price of increased memory requirements. But first, let us ask: why are these methods needed in the first place? There is work using power series to obtain stochastic estimates of log determinants (Han et al., 2015; Chen et al., 2019)

, and one might consider using them in our setting. However, these series require knowledge of the singular values of

, to which we do not have access (constructing to obtain its singular values would defeat the purpose of using the power series in the first place), and we would thus not have a guarantee that the series are valid. Additionally, they have to be truncated and thus result in biased estimators, and using Russian roulette estimators to avoid bias (Chen et al., 2019) can result in infinite variance (Cornish et al., 2020). Finally, these series compute and backpropagate (w.r.t. ) through products of the form for different values of , which can easily require more matrix-vector products than our methods.

Having motivated our approach, we now use commonly-known properties of AD to derive it; we briefly review these properties in Appendix C, referring the reader to Baydin et al. (2018) for more detail. First, we consider the problem of explicitly constructing . This construction can then be used to evaluate and exactly compute its log determinant either for log density evaluation of a trained model, or to backpropagate (with respect to ) through both the log determinant computation and the matrix construction, thus avoiding having to use stochastic gradients as in the previous section. We refer to this procedure as the exact method. Naïvely, one might try to explicitly construct using only backward-mode AD, which would require vector-Jacobian products (vjps) of the form – one per basis vector (and then stacking the resulting row vectors vertically). A better way to explicitly construct is with forward-mode AD, which only requires Jacobian-vector products (jvps) , again one per basis vector

(and then stacking the resulting column vectors horizontally). We use a custom implementation of forward-mode AD in the popular PyTorch

(Paszke et al., 2019) library111PyTorch has a forward-mode AD implementation which relies on the “double backward” trick, which is known to be memory-inefficient. See for a description. for the exact method, as well as for the forward-backward AD trick described below.

We now explain how to combine forward- and backward-mode AD to obtain efficient matrix-vector products against in order to obtain the tractable gradient estimates from the previous section. Note that can be computed with a single jvp call, and then can be efficiently computed using only a vjp call. We refer to this way of computing matrix-vector products against as the forward-backward AD trick. Note that (12) requires such matrix-vector products, which is seemingly less efficient as it is potentially greater than the jvps required by the exact method. However, the stochastic method is much more memory-efficient than its exact counterpart when optimizing over : of the matrix-vector products needed to evaluate (12), only require gradients with respect to . Thus only jvps and vjps, along with their intermediate steps, must be stored in memory over a training step. In contrast, the exact method requires gradients (w.r.t. ) for every one of its jvp computations, which requires storing these computations along with their intermediate steps in memory.

Our proposed methods thus offer a memory vs. variance trade-off. Increasing in the stochastic method results in larger memory requirements which imply longer training times, as the batch size must be set to a smaller value. On the other hand, the larger the memory cost, the smaller the variance of the gradient. This still holds true for the exact method, which results in exact gradients, at the cost of increased memory requirements (as long as ; if is large enough the stochastic method should never be used over the exact one). Table 1 summarizes this trade-off.

Exact (naive) vjps vjps
Exact jvps jvps
Stochastic jvps vjps jvps vjps
Table 1: Number of jvps and vjps (with respect to inputs) needed for forward and backward passes (with respect to ), along with the corresponding variance of gradient entries.

5 Experiments

We now compare our methods against the two-step baseline of Brehmer and Cranmer (2020), and also study the memory vs. variance trade-off. We use the real NVP (Dinh et al., 2017)

architecture for all flows, except we do not use batch normalization

(Ioffe and Szegedy, 2015) as it causes issues with vjp computations. We point out that all comparisons remain fair, including a detailed explanation of this phenomenon in Appendix D, along with all experimental details in Appendix F. Throughout, we use the labels RNFs-ML for our maximum likelihood training method, RNFs-TS for the two-step method, and RNFs for rectangular NFs in general. For most runs, we found it useful to anneal the likelihood term(s). That is, at the beginning of training we optimize only the reconstruction term, and then slowly incorporate the other terms. This likelihood annealing procedure helped avoid local optima where the manifold is not recovered (large reconstruction error) but the likelihood of projected data is high222Our code will be made available at upon publication..

5.1 Simulated Data

We consider a simulated dataset where we have access to ground truth, which allows us to empirically verify the deficiencies of RNFs-TS. We use a von Mises distribution, which is supported on the one-dimensional unit circle in . Figure 1 shows this distribution, along with its estimates from RNFs-ML (exact) and RNFs-TS. As previously observed, RNFs-TS correctly approximate the manifold, but fail to learn the right distribution on it. In contrast we can see that RNFs-ML, by virtue of including the Jacobian-transpose-Jacobian term in the optimization, manage to recover both the manifold and the distribution on it (top left panel), while also resulting in an easier-to-learn low-dimensional distribution (bottom middle panel) thanks to mapping to at a more consistent speed (bottom left panel). We do point out that, while the results presented here are representative of usual runs for both methods, we also had runs with different results which we include in Appendix F. We finish with the observation that even though the line and the circle are not homeomorphic and thus RNFs are not perfectly able to recover the support, they manage to adequately approximate it.

RNFs-ML (exact) density von Mises ground truth RNFs-TS density
RNFs-ML (exact) speed Distribution of RNFs-TS speed
Figure 1: Top row: RNFs-ML (exact) (left), von Mises ground truth (middle), and RNF-TS (right). Bottom row: Speed at which maps to (measured as distance between uniformly spaced consecutive points in mapped through ) for RNFs-ML (exact) (left), RNFs-TS (right), and distribution has to learn in order to recover the ground truth, fixing (middle). See text for discussion.

5.2 Tabular Data

We now turn our attention to the tabular datasets used by Papamakarios et al. (2017), now a common benchmark for NFs as well. As previously mentioned, one should be careful when comparing models with different supports, as we cannot rely on test likelihoods as a metric. We take inspiration from the FID score (Heusel et al., 2017)

, which is commonly used to evaluate quality of generated images when likelihoods are not available. The FID score compares the first and second moments of a well-chosen statistic from the model and data distributions using the squared Wasserstein-2 metric (between Gaussians). Instead of using the last hidden layer of a pre-trained classifier as is often done for images, we take the statistic to be the data itself: in other words, our metric compares the mean and covariance of generated data against those of observed data with the same squared Wasserstein-2 metric. We include the mathematical formulas for computing both FID and our modified version for tabular data in Appendix

E. We use early stopping with our FID-like score across all models. Our results are summarized in Table 2, where we can see that RNFs-ML consistently do a better job at recovering the underlying distribution. Once again, these results emphasize the benefits of including the Jacobian-transpose-Jacobian in the objective. Interestingly, except for HEPMASS, the results from our stochastic version with are not significantly exceeded by the exact version or using a larger value of , suggesting that the added variance does not result in decreased empirical performance. We highlight that no tuning was done (except on GAS for which we changed from to ), RNFs-ML outperforming RNFs-TS out-of-the-box here (details are in Appendix F). We report training times in Appendix F, and observe that RNFs-ML take a similar amount of time as RNFs-TS to train for datasets with lower values of

, and while we do take longer to train for the other datasets, our training times remain reasonable and we often require fewer epochs to converge.

RNFs-ML (exact)
RNFs-ML ()
RNFs-ML ()
Table 2: FID-like metric for tabular data (lower is better). Bolded runs are the best or overlap with it.

5.3 Image Data and Out-of-Distribution Detection

We also compare RNFs-ML to RNFs-TS for image modelling on MNIST and FMNIST. We point out that these datasets have ambient dimension , and being able to fit RNFs-ML is in itself noteworthy: to the best of our knowledge no previous method has scaled optimizing the Jacobian-transpose-Jacobian term to these dimensions. We use FID scores both for comparing models and for early stopping during training. We also used likelihood annealing and set , with all experimental details again given in Appendix F. We report FID scores in Table 3, where we can see that we outperform RNFs-TS. Our RNFs-ML variant also outperforms its decreased-variance counterparts. This is partially explained by the fact that we used this variant to tune RNFs-ML, but we also hypothesize that this added variance can be helpful because of the remaining (non-dimension-based) topological mismatch. Nonetheless, once again these results suggest that the variance induced by our stochastic method is not empirically harmful. We also report training times in Appendix F, where we can see the computational benefits of our stochastic method.

    Trained on FMNIST
Figure 2: OoD detection with RNFs-ML (exact).

We further evaluate the performance of RNFs for OoD detection. Nalisnick et al. (2019) pointed out that square NFs trained on FMNIST assign higher likelihoods to MNIST than they do to FMNIST. While there has been research attempting to fix this puzzling behaviour (Alemi et al., 2017, 2018; Choi et al., 2018; Ren et al., 2019), to the best of our knowledge no method has managed to correct it using only likelihoods of trained models. Figure 2 shows that RNFs remedy this phenomenon, and that models trained on FMNIST assign higher test likelihoods to FMNIST than to MNIST. This correction does not come at the cost of strange behaviour now emerging in the opposite direction (i.e. when training on MNIST, see Appendix F for a histogram). Table 3 quantifies these results (arrows point from in-distribution datasets to OoD ones) with the accuracy of a decision stump using only log-likelihood, and we can see that the best-performing RNFs models essentially solve this OoD task. While we leave a formal explanation of this result for future work, we believe this discovery highlights the importance of properly specifying models and of ensuring the use of appropriate inductive biases, in this case low intrinsic dimensionality of the observed data. We point out that this seems to be a property of RNFs, rather than of our ML training method, although our exact method is still used to compute these log-likelihoods at test time. We include additional results on OoD detection using reconstruction errors – along with a discussion – in Appendix F, where we found the opposite unexpected behaviour: FMNIST always has smaller reconstruction errors, regardless of which dataset was used for training.

RNFs-ML (exact)
RNFs-ML ()
RNFs-ML ()
Table 3: FID scores (lower is better) and decision stump OoD accuracy (higher is better).

6 Scope and Limitations

In this paper we address the dimensionality-based misspecification of square NFs while properly using maximum likelihood as the training objective. We thus provide an advancement in the training of RNFs. Our methods, like current likelihood-based deep generative modelling approaches, remain however topologically misspecified: even though we can better address dimensionality, we can currently only learn manifolds homeomorphic to . For example, one could conceive of the MNIST manifold as consisting of connected components (one per digit), which cannot be learned by . We observed during training in image data that the residuals of CG were not close to numerically, even after steps, indicating numerical non-invertibility of the matrix . We hypothesize that this phenomenon is caused by topological mismatch, which we also conjecture affects us more than the baseline as our CG-obtained (or from the exact method) gradients might point in an inaccurate direction. We thus expect our methods in particular to benefit from improved research on making flows match the target topology, for example via continuous indexing (Cornish et al., 2020).

Additionally, while we have successfully scaled likelihood-based training of RNFs far beyond current capabilities, our methods – even the stochastic one – remain computationally expensive for even higher dimensions, and further computational gains remain an open problem. We trained a single RNFs-ML (

) model on the CIFAR-10 dataset

(Krizhevsky et al., 2009)

, but doing this was computationally intensive and tuning hyperparameters has remained challenging. In this preliminary experiment, we did not observe good OoD performance (nor with RNFs-TS) for the SVHN dataset

(Netzer et al., 2011), although anecdotally we may have at least improved on the situation outlined by Nalisnick et al. (2019).

7 Conclusions and Broader Impact

In this paper we argue for the importance of likelihood-based training of rectangular flows, and introduce two methods allowing to do so. We study the benefits of our methods, and empirically show that they are preferable to current alternatives. Given the methodological nature of our contributions, we do not foresee our work having any negative ethical implications or societal consequences.


We thank Brendan Ross, Jesse Cresswell, and Maksims Volkovs for useful comments and feedback. We would also like to thank Rob Cornish for the excellent CIFs codebase upon which our code is built, and Emile Mathieu for plotting suggestions.


  • Alemi et al. [2017] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. ICLR, 2017.
  • Alemi et al. [2018] A. A. Alemi, I. Fischer, and J. V. Dillon. Uncertainty in the variational information bottleneck. arXiv preprint arXiv:1807.00906, 2018.
  • Baydin et al. [2018] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind.

    Automatic differentiation in machine learning: a survey.

    Journal of machine learning research, 18, 2018.
  • Behrmann et al. [2021] J. Behrmann, P. Vicol, K.-C. Wang, R. Grosse, and J.-H. Jacobsen. Understanding and mitigating exploding inverses in invertible neural networks. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1792–1800. PMLR, 2021.
  • Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Brehmer and Cranmer [2020] J. Brehmer and K. Cranmer. Flows for simultaneous manifold learning and density estimation. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • Chen et al. [2019] R. T. Q. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Choi et al. [2018] H. Choi, E. Jang, and A. A. Alemi. Waic, but why? generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392, 2018.
  • Cornish et al. [2020] R. Cornish, A. Caterini, G. Deligiannidis, and A. Doucet. Relaxing bijectivity constraints with continuously indexed normalising flows. In International Conference on Machine Learning, pages 2133–2143. PMLR, 2020.
  • Cunningham et al. [2020] E. Cunningham, R. Zabounidis, A. Agrawal, I. Fiterau, and D. Sheldon. Normalizing flows across dimensions. arXiv preprint arXiv:2006.13070, 2020.
  • Dai and Wipf [2019] B. Dai and D. Wipf. Diagnosing and enhancing vae models. ICLR, 2019.
  • Dinh et al. [2014] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Dinh et al. [2017] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. ICLR, 2017.
  • Durkan et al. [2019] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Gardner et al. [2018] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in Neural Information Processing Systems, 31, 2018.
  • Gemici et al. [2016] M. C. Gemici, D. Rezende, and S. Mohamed. Normalizing flows on riemannian manifolds. arXiv preprint arXiv:1611.02304, 2016.
  • Han et al. [2015] I. Han, D. Malioutov, and J. Shin. Large-scale log-determinant computation through stochastic chebyshev expansions. In International Conference on Machine Learning, pages 908–917. PMLR, 2015.
  • Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • Huang et al. [2021] C.-W. Huang, R. T. Chen, C. Tsirigotis, and A. Courville.

    Convex potential flows: Universal probability distributions with optimal transport and convex optimization.

    ICLR, 2021.
  • Hutchinson [1989] M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
  • Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
  • Kahn [1955] H. Kahn. Use of different Monte Carlo sampling techniques. Rand Corporation, 1955.
  • Karush [1939] W. Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939.
  • Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
  • Kingma and Dhariwal [2018] D. P. Kingma and P. Dhariwal. Glow: generative flow with invertible 1 1 convolutions. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
  • Kingma et al. [2016] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, volume 30, 2016.
  • Kobyzev et al. [2020] I. Kobyzev, S. Prince, and M. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • Krantz and Parks [2008] S. G. Krantz and H. R. Parks. Geometric integration theory. Springer Science & Business Media, 2008.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Kuhn and Tucker [1951] H. W. Kuhn and A. Tucker. W., 1951," nonlinear programming,". In

    Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability(University of California Press, Berkeley, CA)

    , volume 481492, 1951.
  • Kumar et al. [2020] A. Kumar, B. Poole, and K. Murphy. Regularized autoencoders via relaxed injective probability flow. In International Conference on Artificial Intelligence and Statistics, pages 4292–4301. PMLR, 2020.
  • LeChun [1998] Y. LeChun. The mnist database of handwritten digits, 1998. URL http://yann. lecun. com/exdb/mnist, 1998.
  • Loaiza-Ganem et al. [2017] G. Loaiza-Ganem, Y. Gao, and J. P. Cunningham. Maximum entropy flow networks. ICLR, 2017.
  • Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. ICLR, 2019.
  • Lu et al. [2021] C. Lu, J. Chen, C. Li, Q. Wang, and J. Zhu. Implicit normalizing flows. ICLR, 2021.
  • Mathieu and Nickel [2020] E. Mathieu and M. Nickel. Riemannian continuous normalizing flows. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • Nalisnick et al. [2019] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? ICLR, 2019.
  • Netzer et al. [2011] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Nocedal and Wright [2006] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • Oktay et al. [2021] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams. Randomized automatic differentiation. ICLR, 2021.
  • Papamakarios et al. [2017] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • Papamakarios et al. [2019] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.
  • Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala.

    Pytorch: An imperative style, high-performance deep learning library.

    In Advances in Neural Information Processing Systems 32, volume 32. 2019.
  • Pennec [2006] X. Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006.
  • Petersen and Pedersen [2008] K. B. Petersen and M. S. Pedersen. The matrix cookbook, Oct. 2008. URL Version 20081110.
  • Potapczynski et al. [2021] A. Potapczynski, L. Wu, D. Biderman, G. Pleiss, and J. P. Cunningham. Bias-free scalable gaussian processes via randomized truncations. International Conference on Machine Learning, to appear, 2021.
  • Ren et al. [2019] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Rezende and Mohamed [2015] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
  • Rezende et al. [2020] D. J. Rezende, G. Papamakarios, S. Racaniere, M. Albergo, G. Kanwar, P. Shanahan, and K. Cranmer. Normalizing flows on tori and spheres. In International Conference on Machine Learning, pages 8083–8092. PMLR, 2020.
  • Shewchuk et al. [1994] J. R. Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • Szegedy et al. [2015] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1–9, 2015.
  • Xiao et al. [2017] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Appendix A Injective Change-of-Variable Formula and Stacking Injective Flows

We first derive (5) from (3

). By the chain rule, we have:


The Jacobian-transpose Jacobian term in (3) thus becomes:


where the second equality follows from the fact that , , and are all square matrices; and the third equality follows because determinants are invariant to transpositions. The observation that the three involved matrices are square is the reason behind why we can decompose the change-of-variable formula for as applying first the change-of-variable formula for , and then applying it for .

This property, unlike in the case of square flows, does not always hold. That is, the change-of-variable formula for a composition of injective transformations is not necessarily equivalent to applying the injective change-of-variable formula twice. To see this, consider the case where and are injective, where and let . Clearly is injective by construction, and thus the determinant from its change-of-variable formula at a point is given by:


where now and . Unlike the determinant from (14), this determinant cannot be easily decomposed into a product of determinants since the involved matrices are not all square. In particular, (15) need not match:


which would be the determinant terms from applying the change-of-variable formula twice. Note that this observation does not imply that a flow like could not be trained with our method, it simply implies that the term has to be considered as a whole, and not decomposed into separate terms. It is easy to verify that in general, only an initial -dimensional square flow can be separated from the overall Jacobian-transpose-Jacobian determinant.

Appendix B Conjugate Gradients

We outline the CG algorithm in Algorithm 1, whose output we write as in the main manuscript. Note that CG does not need access to , just a matrix-vector product routine against , . If is symmetric positive definite, then CG converges in at most steps, i.e. its output matches and the corresponding residual is , and CG uses thus at most calls to . This convergence holds mathematically, but can be violated numerically if is ill-conditioned, which is why the condition is added in the while loop.

Input : , function for matrix-vector products against

, tolerance
Output : 
// current solution
// current residual
while  do
end while
Algorithm 1 CG

Appendix C Automatic Differentiation

Here we summarize the relevant properties from forward- and backward-mode automatic differentiation (AD) which we use in the main manuscript. Let be the composition of smooth functions , i.e. . For example, in our setting this function could be , so that , and the rest of the functions could be coupling layers from a -dimensional square flow (or the functions whose compositions results in the coupling layers). By the chain rule, the Jacobian of is given by:


where for . Forward-mode AD computes products from right to left, and is thus efficient for computing jvp operations. Computing is thus obtained by performing matrix-vector multiplications, one against each of the Jacobians on the right hand side of (17). Backward-mode AD computes products from left to right, and would thus result in significantly more inefficient jvp evaluations involving matrix-matrix products, and a single matrix-vector product. Analogously, backward-mode AD computes vjps of the form efficiently, using vector-matrix products, while forward-mode AD would require matrix-matrix products and a single vector-matrix product.

Typically, the cost of evaluating a matrix-vector or vector-matrix product against (or ) is the same as computing from , i.e. the cost of evaluating (or the cost of evaluating in the case of ) [Baydin et al., 2018]. jvp and vjp computations thus not only have the same computational cost, but this cost is also equivalent to a forward pass, i.e. computing .

When computing , obtaining a jvp with forward-mode AD adds the same memory cost as another computation of since intermediate results do not have to be stored. That is, in order to compute , we only need to store and (which has to be stored anyway for computing ) in memory. On the other hand, computing a vjp with backward-mode AD has a higher memory cost: One has to first compute and store all the intermediate (along with ), since computing from requires having in memory.

Appendix D Batch Normalization

We now explain the issues that arise when combining batch normalization with vjps. These issues arise not only in our setting, but every time backward-mode AD has to be called to compute or approximate the gradient of the determinant term. We consider the case with a batch of size , and , as it exemplifies the issue and the notation becomes simpler. Consider applying (without batch normalization) to each element in the batch, which we denote with the batch function :


The Jacobian of clearly has a block-diagonal structure:


This structure implies that relevant computations such as vjps, jvps, and determinants parallelize over the batch:


In contrast, when using batch normalization, the resulting computation does not have a block-diagonal Jacobian, and thus this parallelism over the batch breaks down, in other words:


where the above signs should be interpreted as “not generally equal to” rather than always not equal to, as equalities could hold coincidentally in rare cases.

In square flow implementations, AD is never used to obtain any of these quantities, and the Jacobian log determinants are explicitly computed for each element in the batch. In other words, this batch dependence is ignored in square flows, both in the log determinant computation, and when backpropagating through it. Elaborating on this point, AD is only used to backpropagate (with respect to ) over this explicit computation. If AD was used on to construct the matrices and we then computed the corresponding log determinants, the results would not match with the explicitly computed log determinants: The latter would be equivalent to using batch normalization with a stop_gradient operation with respect to but not with respect to , while the former would use no stop_gradient whatsoever. Unfortunately, this partial stop_gradient operation only with respect to inputs but not parameters is not available in commonly used AD libraries. While our custom implementation of jvps can be easily “hard-coded” to have this behaviour, doing so for vjps would require significant modifications to PyTorch. We note that this is not a fundamental limitation and that these modifications could be done to obtain vjps that behave as expected with a low-level re-implementation of batch normalization, but these fall outside of the scope of our paper. Thus, in the interest of performing computations in a manner that remains consistent with what is commonly done for square flows and that allows fair comparisons of our exact and stochastic methods, we avoid using batch normalization.

Appendix E FID and FID-like Scores

For a given dataset and a set of samples generated by a model , along with a statistic , the empirical means and covariances are given by:


The FID score takes as the last hidden layer of a pretrained inception network [Szegedy et al., 2015], and evaluates generated sample quality by comparing generated moments against data moments. This comparison is done with the squared Wasserstein-2 distance between Gaussians with corresponding moments, which is given by:


which is if and only if the moments match. Our proposed FID-like score for tabular data is computed the exact same way, except no inception network is used. Instead, we simply take to be the identity, .

Appendix F Experimental Details

First we will comment on hyperparameters/architectural choices shared across experiments. The -dimensional square flow that we use, as mentioned in the main manuscript, is a RealNVP network [Dinh et al., 2017]. In all cases, we use the ADAM [Kingma and Ba, 2015] optimizer and train with early stopping against some validation criterion specified for each experiment separately and discussed further in each of the relevant subsections below. We use no weight decay. We also do not use batch normalization in any experiments for the reasons mentioned above in Appendix D. We use a standard Gaussian on dimensions as in all experiments.


We ran our two-dimensional experiments on a Lenovo T530 laptop with an Intel i5 processor, with negligible training time per epoch. We ran the tabular data experiments on a variety of NVIDIA GeForce GTX GPUs on a shared cluster: we had, at varying times, access to 1080, 1080 Ti, and 2080 Ti models, but never access to more than six cards in total at once. For the image experiments, we had access to a 32GB-configuration NVIDIA Tesla v100 GPU. We ran each of the tabular and image experiments on a single card at a time, except for the image experiments for the RNFs-ML (exact) and () models which we parallelized over four cards.

Table 4 includes training times for all of our experiments. Since we used FID-like and FID scores for ealy stopping, we include both per-epoch and total times. Per epoch times of RNFs-ML exclude epochs where the Jacobian-transpose-Jacobian log determinant is annealed with a weight, although we include time added from this portion of training into the total time cost. Note throughout this section we also consider one epoch of the two-step baseline procedure to be one full pass through the data training the likelihood term, and then one full pass through the data training the reconstruction term.

Dataset RNFs-ML (exact) RNFs-ML () RNFs-ML () RNFs-TS
Table 4: Training times in seconds, “” means for tabular data and for images.

f.1 Simulated Data

The data for this experiment is simulated from a von Mises distribution centred at projected onto a circle of radius . We randomly generate training data points and train with batch sizes of . We use points for validation, performing early stopping using the value of the full objective and halting training when we do not see any validation improvement for epochs. We create visualizations in Figure 1 and Figure 3 by taking grid points equally-spaced between and as the low-dimensional space, project these into higher dimensions by applying the flow