Flows for simultaneous manifold learning and density estimation

by   Johann Brehmer, et al.
NYU college

We introduce manifold-modeling flows (MFMFs), a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold. Combining aspects of normalizing flows, GANs, autoencoders, and energy-based models, they have the potential to represent data sets with a manifold structure more faithfully and provide handles on dimensionality reduction, denoising, and out-of-distribution detection. We argue why such models should not be trained by maximum likelihood alone and present a new training algorithm that separates manifold and density updates. With two pedagogical examples we demonstrate how manifold-modeling flows let us learn the data manifold and allow for better inference than standard flows in the ambient data space.


page 1

page 7

page 10

page 11


Joint Manifold Learning and Density Estimation Using Normalizing Flows

Based on the manifold hypothesis, real-world data often lie on a low-dim...

Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows

Normalizing flows are generative models that provide tractable density e...

Flow Based Models For Manifold Data

Flow-based generative models typically define a latent space with dimens...

Testing the boundaries: Normalizing Flows for higher dimensional data sets

Normalizing Flows (NFs) are emerging as a powerful class of generative m...

Sliced-Wasserstein normalizing flows: beyond maximum likelihood training

Despite their advantages, normalizing flows generally suffer from severa...

Nonlinear Isometric Manifold Learning for Injective Normalizing Flows

To model manifold data using normalizing flows, we propose to employ the...

Energy Flows: Towards Determinant-Free Training of Normalizing Flows

Normalizing flows are a popular approach for constructing probabilistic ...

Code Repositories


Manifold-learning flows (ℳ-flows)

view repo

1 Introduction

Inferring a probabilistic model from some example data points is a common problem that is increasingly often tackled with deep generative models. Generative adversarial networks (GANs) (1) and variational autoencoders (VAEs) (2) are both based on a lower-dimensional latent space and a learnable mapping from that to the data space. In essence, these models describe a lower-dimensional data manifold embedded in the data space. While they allow for efficient sampling, their probability density (or likelihood) is intractable, leading to a challenge for training and limiting their usefulness for inference tasks. On the other hand, normalizing flows (3, 4, 5) are based on a latent space with the same dimensionality as the data space and a diffeomorphism; their tractable density permeates through the full data space and is not restricted to a lower-dimensional surface.

Figure 1: Sketch of how a standard normalizing flow in the ambient data space (left, orange surface) and a manifold-modeling flow (right, purple) model data (black dots).

The flow approach may be unsuited to data points that do not populate the full feature space they are parameterized in, but are restricted to a lower-dimensional data manifold. Normalizing flows are by construction not able to represent such a data structure exactly, instead they learn a smeared-out version with support off the data manifold. We illustrate this in the left panel of Figure 1, where the black dots represent 2D data populating a 1D manifold and the orange surface sketches the density learned by a normalizing flow. In addition, the requirement of latent spaces with the same dimension as the data space increases the memory footprint and computational cost of the model. While flows have been generalized from Euclidean feature spaces to Riemannian manifolds (6), this approach has so far been limited to the case where the chart for the manifold is prescribed.

Here we introduce manifold-modeling flows (MFMF): normalizing flows based on an injective, invertible map from a lower-dimensional latent space to the data space. MFMFs simultaneously learn the shape of the data manifold, provide a tractable bijective chart, and learn a probability density over the manifold, as sketched in the right panel of Figure 1. When evaluating the model, the input (which may be off the manifold) is first projected onto the manifold and the model returns both the distance from the manifold as well as the density on the manifold after the projection.

The MFMF approach marries aspects of normalizing flows, GANs, and autoencoders. Compared to flows on prescribed manifolds, this approach relaxes the requirement of knowing a closed-form expression for the chart from latent variables to the data manifold and instead learns the manifold from data. In contrast to GANs and VAEs, it not only provides an exact tractable likelihood over the data manifold, but also a prescription for how to treat points off the manifold. In contrast to standard autoencoders, it is a probabilistic model with a generative mode and tractable density. Similar to invertible autoencoders (7), MFMFs ensure that for data points on the manifold the encoder and decoder are the inverse of each other. They can also be seen as regularized autoencoders (8, 9). Compared to standard flow-based generative models, MFMFs offer five advantages:

  • Manifold-modeling flows may more accurately approximate the true data distribution, avoiding probability mass off the data manifold. This in turn could lead to performance gains in inference and generative tasks.

  • The model architecture naturally allows one to model a conditional density that lives on fixed manifold. This should improve data efficiency in such situations as it is ingrained in the architecture and does not need to be learned.

  • The lower-dimensional latent space reduces the complexity of the model, allowing us to use more expressive transformations or scale to higher-dimensional data spaces within a given computational budget.

  • The projection onto the data manifold provides dimensionality reduction and denoising capabilities. The distance to the manifold may also be useful to detect out-of-distribution samples.

The MFMF model embraces the idea of energy-based models (10, 11, 12) for dealing with off-the-manifold issues through a non-probabilistic distance measure, while retaining a tractable density on the data manifold. Similarly, we can link it to the development of adversarial objectives for GANs: the original GAN setup (1), in which a generator is pitted against a discriminator, corresponds to training based on a proxy for the likelihood ratio. Off the data manifold this density ratio is not well-defined, which makes the training challenging. Wasserstein GANs (13) address this issue by measuring distances between two data manifolds in feature space. Similarly, the likelihood of normalizing flows is not appropriate when data populates a lower-dimensional manifold; the MFMF model augments flows with a distance measure in feature space to measure closeness to the data manifold.

Training an MFMF model faces two challenges. First, maximum likelihood is not enough: we will demonstrate that the training dynamics for naive likelihood-based training may not lead to a good estimate of the manifold and the density on it. Second, evaluating the MFMF density can be computationally expensive. We will discuss several new training strategies that solve these challenges. In particular, we introduce a new training scheme with separate manifold and density updates, which allows for a computationally efficient training of MFMF models and incentivizes both good manifold quality and good density estimation on the manifold.

We begin with a broad discussion of the notion of data manifolds in different generative models and introduce manifold-modeling flows in Section 2. In Section 3 we discuss pitfalls when training MFMF models and introduce training strategies that can overcome these challenges. In Section 4 we demonstrate MFMF in some experiments. We comment on related work in Section 5 before summarizing the results in Section 6. The code used in our study is available at http://github.com/johannbrehmer/manifold-flow.

2 Generative models and the data manifold

Consider a true data-generating process that draws samples according to , where is a -dimensional Riemannian manifold embedded in the -dimensional data space and . We consider the two problems of estimating the density as well as the manifold given some training samples . We will later extend our models with a projection to the manifold so that they can also handle problems in which the data is only approximately restricted to a manifold.

In the following we will discuss which types of generative models address which parts of this problem and generally discuss the relation between the data manifold and various classes of generative models. In the process we will also introduce the new manifold-modeling flows (MFMF). We distinguish between three different classes of models:

  1. manifold-free models defined in the ambient space ,

  2. models for an explicitly prescribed manifold, and

  3. models that learn an unknown manifold.

In this discussion we rely on a few simplifying assumptions. We treat the manifold as topologically equivalent to ; in particular, we assume that it is connected and can be described by a single chart. We also assume that the dimensionality of the manifold is known. In Section 2.2.4 we will discuss how these requirements can be lifted.

To facilitate a straightforward comparison, we will describe all generative models in terms of two vectors of latent variables

and , where is the latent space that maps to the learned manifold , i. e. the coordinates of the manifold. parameterizes any remaining latent variables, representing the directions “off the manifold”.

In Figure 2 we sketch the setup of the different models. In Table 1 we summarize some of their properties.

2.1 Manifold-free models

Ambient flow (AF).

In these conventions, a standard Euclidean normalizing flow (5) in the ambient data space is a diffeomorphism


together with a tractable base density (such as a multivariate unit Gaussian) . According to the change-of-variable formula, the density in is then given by


where is the Jacobian of , a matrix.

is usually implemented as a neural network with certain constraints that make

efficient to compute. In the generative mode, flows sample and from their base densities and apply the transformation , leading to samples .

There is typically no difference between and . While some models employ multi-scale architectures where some latent variables have more transformations applied to them than others (4), there is no explicit incentive for the network to align these directions in the latent space with coordinates on the data manifold and off-the-manifold directions, respectively. This model therefore has no notion of a data manifold, they only describe regions of varying probability density in the overall ambient feature space. We will therefore refer to it as ambient flow (AF).

2.2 Prescribed manifold

Flow on a manifold (FOM).

When a chart (or an atlas of multiple charts) for the manifold is known a priori, one can construct a flow on this manifold (6). If a diffeomorphism


is the sole chart for the manifold, the density on the manifold is given by


where is the Jacobian of , an matrix. The latent variables are the coordinates of the manifold. The density in this coordinate space can then be modeled with a regular normalizing flow in dimensions with a learnable diffeomorphic transformation


and a base density . Then


where is the Jacobian of .

Sampling from such a flow is straightforward and consists of drawing from the base density and transforming the variable with and then . Depending on the choice of chart, the model likelihood in (4) can be evaluated efficienty, and the model is by construction limited to the true manifold. This approach has been worked out for spheres and tori of arbitrary dimension (14), for hyperbolic manifolds (15), as well as for a problem in theoretical physics where the manifold consists of a particular product of groups (16).

2.3 Learning the manifold

Figure 2: Schematic relation between data and various latent variables in the different generative models discussed in Section 2. Red arrows represent learnable transformations, while black arrows stand for fixed transformations. Solid lines show invertible bijections, dashed lines denote injections that are invertible within their image, and dotted lines show unrestricted transformations that may be neither injective nor invertible.

Generative adversarial network (GAN).

GANs map an -dimensional latent space to the data space,


Here is a learnable map like a deep neural network rather than a prescribed closed-form chart. This map is neither restricted to be injective not invertible: there can be multiple that correspond to the same data point . Therefore is not a chart and the image of this transformation not necessarily a Riemannian manifold, though in practice this distinction is not relevant and we will simply call this subset a manifold.

While the lack of restrictions on increases the expressivity of the neural network, it also makes the model density intractable. This drawback has two immediate consequences. First, GANs have to be trained adversarially as opposed to by maximum likelihood. Second, despite their built-in manifold-like structure GANs are neither well-suited for inference tasks that require to evaluate the model density nor for manifold learning tasks.111Reference (17) introduces a method that allows to calculate the GAN density at least approximately, though this approach neglects the possibility of multiple pointing to the same . PresGANs (18) add a noise term to the generative procedure, similar to a VAE, as well as a numerical method to evaluate the model density approximately using importance sampling. Finally, note that in conditional GANs both the shape of the manifold as well as the implicit density on it generally depend on the variables being conditioned on.222To fix the manifold but let the density on it be conditional, one could make independent of and model with a conditional density estimator such as a normalizing flow. Such a partially conditional GAN setup has, to the best of our knowledge, not yet been explored in the literature.

Variational autoencoder (VAE).

Variational autoencoders also map a lower-dimensional latent space to the data space, but instead of a deterministic function they use a stochastic decoder . The marginal density


of the model therefore extends off the manifold into the whole space . This marginal density itself is intractable, though there is a variational lower bound (the ELBO) for it that is commonly used as a training objective.

Nevertheless, the lower-dimensional latent space of a VAE is often associated with a learned data manifold. Often only the final step in the decoder is stochastic, for instance as a Gaussian density in data space where the mean is a learned function of the latent variables. Then one can define an alternative generative mode by using this mean instead of sampling from the Gaussian, replacing the stochastic decoder with a deterministic one. In this way the generated samples are restricted to a lower-dimensional subset . While not strictly a manifold, for all practical purposes it is equivalent to one. However, generating in this mode does not correspond to sampling from , which was used to train the model.

Model Manifold Chart Generative mode Tractable density Restricted to manifold
Ambient flow (AF) no manifold
Flow on manifold (FOM) prescribed
Generative adversarial network (GAN) learned
Variational autoencoder (VAE) learned only ELBO ()
Pseudo-invertible encoder (PIE) learned ()
Slice of PIE learned up to normalization
Manifold-modeling flow (MFMF) learned (may be slow)
Manifold-modeling flow with sep. encoder (MFMFE) learned (may be slow)
Table 1: Generative models for data that populate a lower-dimensional manifold. We differentiate the models by whether they have a prescribed or learned internal notion of the data manifold, whether they provide access to a diffeomorphic chart of that manifold, if they allow us to generate samples, whether they have a tractable density, and whether the model density is actually restricted to the manifold (as opposed to the full feature space). In the last column, parantheses () mean that an alternative sampling procedure can generate data just from the manifold, but that this sampling process does not follow the model density.

Pseudo-Invertible Encoder (PIE).

One way to give ambient flows a notion of a (learnable) manifold is to treat some of the latent variables differently from others and rely on the training to align one class of latent variables with the manifold coordinates and the other class of latent variables with the off-the-manifold directions. This is the essential idea behind the Pseudo-Invertible Encoder (PIE) architecture (19). Its basic setup is given by the flow transformation of (1) and the flow density in (2). The key difference is that in PIE one chooses different base densities for the latent variables , which are designated to represent the coordinates on the manifold, and , which should learn the off-the-manifold directions in latent space: the base density is modeled with an -dimensional Euclidean flow, i. e. a transformation that maps it to another latent variable associated with a standard base density such as a unit Gaussian. The off-the-manifold base density is chosen such that it sharply peaks around

, for instance as a Gaussian with a small variance

in each direction.

For sufficiently flexible transformations, this architecture has the same expressivity as an ambient flow, independent of the orientation of the latent space. In particular, a single scaling layer can learn to absorb the difference in base densities, allowing the flow to squeeze any region of data space into the narrow base density and thus fit the data equally well independent of how the latent variables and are aligned with the data manifold. From that perspective it does not seem like PIE is actually a different model than AF. Yet somehow in practice learning dynamics and the inductive bias of the model seem to couple in a way that favor an alignment of the level set with the data manifold. Understanding these dynamics better would be an interesting research goal.

In many ways, PIE walks and quacks like an ambient flow. In particular, the model density in (2) generally has support over the full data space , extending beyond the manifold. To sample from this density, one would still draw and and apply a transformation .

However, the labelling of different latent directions as manifold coordinates and off-the-manifold directions gives us some new handles. The authors of Reference (19) define a generative mode that samples data only from the learned manifold: one samples as usually, but fixes , and then applies the transformation . This is similar to sampling from the learned manifold for a VAE when the Gaussian mean is used as a deterministic encoder. If the inductive bias of the PIE model successfully leads to an alignment of with the manifold coordinates, this allows us to sample only from the manifold. Note, however, that the density defined by this sampling procedure is not the same as the tractable density .333Sampling with corresponds to the density in (14), not to the one in (2). Even when restricted to , these two densities need not even be proportional to each other. To see this explicitly, we can write the Jacobian of as in column notation. Then for we have

which is in general not proportional to
The discrepancy comes from containing additional factors that describe how the flow “squeezes” and “relaxes” off-the-manifold latent variables around the manifold, but those terms do not play a role for . This is the case even if we restrict to volume-preserving flows. The discrepancy also survives when we consider in the limit . For a concrete example, think of standard 2D polar coordinates, where plays the role of and that of . Let the manifold be given by the line . Then , while . Training a PIE model by maximizing the likelihood in (2) and then sampling from the manifold with

is therefore inconsistent. Finally, note that the hyperparameter

allows us to smoothly interpolate between an ambient flow (

) and “manifolds” ().

Slice of PIE.

The PIE architecture defines a density over the full data space, and the level set defines a manifold . It may therefore be tempting to study the density on induced by , which is defined as


While the normalizing integral in (9) cannot be computed efficiently, with (2) we can compute easily enough, so this likelihood is tractable up to an unknown normalizing constant. Depending on the task, this may or may not be sufficient.

The more pressing issue with this model is the generative mode. The density in (9) is not the same as the density defined by sampling data from the manifold, i. e. drawing and pushing it into data space with . More importantly, we do not know how to sample from (9) efficiently.3

Manifold-modeling flow (MFMF).

We now introduce the main new algorithm of this paper: the manifold-modeling flow (MFMF). It combines the learnable manifold aspect of GANs with the tractable density of flows on manifolds (FOM) without introducing inconsistencies between generative mode and the tractable likelihood. We begin by modeling the relation between the latent space and data space with a diffeomorphism


just as for an ambient flow or PIE. We define the model manifold through the level set


In practice, we implement this transformation as a zero padding followed by a series of invertible transformations,




denotes padding a -dimensional vector with zeros and the invertible transformations operate in -dimensional space. Viewed as a map from the latent space to the data space , the transformation is injective and (when restricted to its image ) invertible.

Just as for FOM and PIE, we model the base density with an -dimensional flow , which maps to another latent variable with an associated tractable base density . There is no need for a base density over the off-the-manifold variables in this approach. The induced probability density on the manifold is then given by


This is the same as (4), except with a learnable transformation rather than a prescribed, closed-form chart. This model density is defined only on the manifold and normalized to the manifold, .

Sampling from an MFMF is straightforward: one draws and pushes the latent variable forward to the data space as followed by , leading to data points on the manifold that consistently follow .

Figure 3: Sketch of how an MFMF evaluates arbitrary points on or off the learned manifold. On the left side we show the data space with data samples (grey) and the embedded manifold (orange). On the right side the latent space is shown. In purple we sketch the evaluation of a data point including its transformation to the latent space, the projection onto the manifold coordinates, and the transformation back to the manifold.

As a final ingredient to the MFMF approach, we add a prescription for evaluating arbitrary points , which may be off the manifold. As we illustrate in Figure 3, maps from a low-dimensional latent space to the data space and is thus essentially a decoder. We define a matching encoder as followed by a projection to the component:


with . This extends the inverse of (which is so far only defined for ) to the whole data space . Similar to an autoencoder, combining and allows us to calculate a reconstruction error


which is zero if and only if . Unlike for standard autoencoders, the encoder and decoder are exact inverses of each other as long as points on the manifold are studied.

For an arbitrary , an MFMF thus lets us compute three quantities:

  • The projection onto the manifold , which may be used as a denoised version of the input.

  • The reconstruction error

    , which will be important for training, but may also be useful for anomaly detection or out-of-distribution detection.

  • The likelihood on the manifold after the projection, .

In this way, MFMFs separate the distance from the data manifold and the density on the manifold—two concepts that easily get conflated in an ambient flow. MFMFs embrace ideas of energy-based models for dealing with off-the-manifold issues, but still have a tractable, exact likelihood on the learned data manifold. Figure 3 summarizes how an MFMF model evaluates a data point by transforming to the latent space, projecting onto the manifold (where the density is evaluated), and transforming back to data space (where the reconstruction error is calculated).

Manifold-modeling flows with separate encoder (MFMFE).

Finally, we introduce a variant of the MFMF model where instead of using the inverse followed by a projection as an encoder, we encode the data with a separate function


This encoder is not restricted to be invertible or to have a tractable Jacobian, potentially increasing the expressiveness of the network. Just as in the MFMF approach, for a given data point , the MFMFE model returns a projected point onto the learned manifold , a reconstruction error , and the likelihood on the manifold evaluated after the projection


The added expressivity of this encoder comes at the price of potential inconsistencies between encoder and decoder, which the training procedure will have to try to penalize, exactly as for a standard autoencoders and similar to VAEs.

2.4 Manifolds with unknown dimensionality or nontrivial topology

So far we have made two key assumptions to simplify the learning problem: that we know the manifold dimensionality and that the manifold is topologically equivalent to (in particular that it can be mapped by a single chart). The algorithms presented above can be extended to the more general case where these assumptions are relaxed.

If the dimension of the manifold is not known, a brute-force solution would be to scan over values of and train algorithms for each value. A common metric for flow-based models is the model log likelihood evaluated on a number of test samples, but that criterion is not admissible in this context since the space of the data (and the units of the likelihood) are different for different values of . However, we can compare models with different manifold dimensionality based on the reconstruction error, as well as on downstream tasks such as evaluating the quality of generated samples or the performance on inference tasks. A drop in performance is expected when the model manifold becomes smaller than the true manifold dimension.

Alternatively, for the PIE algorithm one could use trainable values of the base density variance along each latent direction, with suitable regularization favoring values close to 0 or 1. In this way the model can learn the manifold dimensionality directly from the training data.

If the manifold consists of multiple disjoint pieces, potentially with different dimensionality, a mixture model with separate transformations from latent space to data space may work. It remains to be seen if such a model is easy to train. See Reference (14) for a discussion of such issues.

3 Efficient training and evaluation

Having defined the MFMF model, we will now turn to the question is how to train it. Most flow-based generative models are trained by maximum likelihood, with architectures commonly designed with the goal of making the likelihood in (2) efficient to evaluate. For implicit generative models that is not available: GANs are trained adversarially, for instance pitted against a discriminator or using an optimal transport (OT) metric, while VAEs are commonly trained on a lower bound for the marginal likelihood (the ELBO). We will draw on all of these approaches, beginning with a discussion of two challenges of likelihood-based training for MFMFs in Section 3.3.1. We discuss a number of more promising training strategies in Section 3.3.2, before commenting on steps to also make the evaluation of the likelihood more efficient in Section 3.3.3.

3.1 Maximum likelihood is not enough

A subtlety in the naive interpretation of the density.

Since the MFMF model has a tractable density, maximum likelihood is an obvious candidate for a training objective. However, the situation is more subtle as the MFMF model describes the density after projecting onto the learned manifold. The definition of the data variable in the likelihood hence depends on the weights of the manifold-defining transformation , and a comparison of naive likelihood values between different configurations of is meaningless. Instead of thinking of a likelihood function , where are the weights of the transformation , it is instructive to think of a family of likelihood functions parameterized by the different .

Training MFMFs by simply maximizing the naive likelihood is therefore not meaningful, does not incentivize the network to learn the right shape of the manifold, and probably will not converge to the true model. As an extreme example, consider a model manifold that is perpendicular to the true data manifold. Since this configuration allows the MFMF to project all points to a region of very high density on the model manifold, this pathological configuration may lead to a very high naive likelihood value.

(a) Setup. The model manifold is a straight line in 2D Euclidean space that passes through the origin and is rotated with respect to the -axis by an angle

. On this line, the density is a Gaussian with mean at the origin, its standard deviation is a model parameter

. The training data (black dots) are generated with and .
(b) Loss functions. Top left: naive log likelihood as a function of the model parameters and . When fixing the manifold to , the true value (black star) maximizes the naive likelihood. However, when varying both parameters, the likelihood can be larger for the pathological configuration and . Top right: reconstruction error when projecting to the model manifold, which is minimized by the true configuration . Bottom left: combined loss given by the reconstruction error minus a small factor times the naive log likelihood—the true configuration is a local minimum, but the global minimum for and persists. Bottom right: Log likelihood after subtracting the maximum log likelihood for each value of .
Figure 4: Toy example showing that maximum naive likelihood is not a suitable training objective for manifold-modeling flows.

We demonstrate this issue in Figure 4 in a simple toy problem. The feature space is two-dimensional, the model manifold consists of a line through the origin with variable angle such that corresponds to a manifold aligned with the -axis and to a manifold aligned with the -axis. On this line we consider a one-dimensional Gaussian probability density with mean at the origin and standard deviation . Training samples are generated from and . The setup is sketched in Figure 3(a). In the top left panel of Figure 3(b) we show how the naive likelihood of this model over the training data depends on the parameters and . When fixing the manifold to the true value , the correct standard deviation indeed maximizes the naive likelihood. However, the model can achieve an even higher naive likelihood for and , representing a manifold that is orthogonal to the true one and projects all data points to a region of extremely high density on the manifold. In this limit the likelihood is in fact unbounded from above. Clearly, maximizing the naive alone is not very good at incentivizing the model to learn the correct manifold.

To address this, we can add a second training objective that is responsible for learning the manifold. A suitable candidate is the reconstruction error discussed in the previous section. The top right panel in Figure 3(b) shows the mean reconstruction error as a function of the model parameters, which is indeed minimal for the true configuration.

One way to combine the two metrics is training on a combined loss that sums reconstruction error and negative naive log likelihood, with hyperparameters weighting the two terms. This helps, but does not really solve the problem. In our toy example we show such a combined loss in the bottom left of Figure 3(b). While the correct configuration is a local minimum of this loss, the wrong minimum at and still exists and leads to a lower (and unbounded from below) loss. In general the correct solution might not even be a local minimum of such a combined loss function. When training the model parameters by minimizing this combined loss, the gradient flow may take the model to the correct solution or a pathological configuration, depending on the initialization and the choice of hyperparameters.

A better strategy is to separate the model parameters that define the manifold from the ones that only describe the density on them. In the MFMF setup in the previous section, the parameters of the transformation make up the first class, while the parameters of (or ) are in the second; in the toy example in Figure 4 fixes the manifold and the density on it. We can then update the manifold parameters based on only the reconstruction error and update the density weights based on only the log likelihood. In Figure 3(b) this corresponds to horizontal steps in the top right panel and vertical steps in the bottom right panel, where we show the log likelihood normalized to the maximum likelihood estimator (MLE) for each value of . Such a training procedure is not prone to the gradient flow leading the model to a pathological configuration. In the limit of infinite capacity, sufficient training data, and succesful optimization, it will correctly learn both the manifold and the density on it.

Evaluating the likelihood can be expensive.

The second challenge is the computational efficiency of evaluating the MFMF density in (14). While this quantity is tractable, it cannot be computed as cheaply as the ambient flow density of (2). The underlying reason is that since the Jacobian is not square, it is not obvious how the determinant can be decomposed further. In particular, when the map consists of multiple functions as given in (12), the Jacobian is given by a product of individual Jacobians, . While the are invertible matrices, the Jacobian that represents the zero-padding is a rectangular matrix that consists of a identity matrix padded with zeros, which leaves us with the following Jacobian to calculate:


This determinant can

be computed explicitly. However, when we compose an MFMF out of invertible transformations that have been designed for standard flows—coupling layers with invertible elementwise transformations, autoregressive transformations, permutations, or invertible linear transformations—evaluating this MFMF density requires the computation of all entries of the Jacobians of the individual transformation. This is a much larger computational effort than in the case of standard flows, where the overall log determinant can be split into a sum over the log determinants of each layer, which in turn can usually be written down as a single number without having to compute all elements of a Jacobian first.

While the computational cost of evaluating (19) is often reasonable for the evaluation of a limited number of test samples, it can be prohibitively expensive during training, which often requires many more evaluations. Since the computational cost grows with increasing data dimensionality , training by maximizing does not scale to high-dimensional problems.

Fortunately, gradient updates do not always require computing the full likelihood of the model. In particular, consider the training procedure introduced in the previous section, where we update the parameters of by minimizing the reconstruction error and update the parameters of (and thus of ) by maximizing the log likelihood. The manifold update phase does not require computing the log likelihood at all. For the density update, the loss functional is given by


with . However, the last term (which is slow to evaluate) does not depend on the parameters of and does not contribute to the gradient updates in this phase! We can therefore just as well train the parameters of by minimizing only the first two terms, which can be evaluated very efficiently.

3.2 Training strategies

Simultaneous manifold and likelihood training (S).

For completeness we include here the simultaneous optimization of the parameters of the manifold-defining transformation and the parameters of the density-defining transformation on a combined loss summing negative naive log likelihood and reconstruction error,


where is a hyperparameter.

Following the discussion in Section 3.1, we do not expect this algorithm to perform very well. First, as demonstrated in the toy example in Figure 4 there is a risk of pathological models with poor manifold quality and poor density estimation for which this loss is very small, potentially even lower than for the true model. Which configuration the model ends up in may critically depend on the initialization and the learning dynamics. Second, evaluating this loss can be computationally expensive, especially for high-dimensional problems. Nevertheless, we include this in our experiments on low-dimensional data for comparison.

In order to ameliorate both the potential instability of this training objective and to speed up the training, we add a pre-training and a post-training phase. In the pre-training, the model is trained by minimizing the reconstruction error only, hopefully pushing the weights of into the basin of attraction around the true model configuration, before the main training phase begins. In the post-training phase, the parameters of are fixed and only the parameters of are updated by minimizing only the relevant terms in the loss.

Separate manifold and density training (M/D)

As discussed above, we expect a both faster and more robust training when separating manifold and density updates, splitting the training into two phases:

Manifold phase:

Update only the parameters of (and thus also , which is defined as a level set of ) by minimizing the reconstruction error from the projection onto the manifold,


with batch size . For the MFMFE model, the parameters of the encoder are also updated during this phase.

Density phase:

Update only the parameters of (which define the coordinate density ) by minimizing the negative log likelihood


An important choice is how these two phases are scheduled. The most straightforward strategy is a sequential training, in which the manifold-defining transformation is learned first, followed by the density-defining transformation . We also experiment with an alternating scheme, where we switch between the two phases after a fixed number of gradient updates. The algorithm is described in more detail in Algorithm 1.

, the learning rate. , , factors weighting terms in the loss functions. and , the batch sizes. and , the number of batches per training phase.
and , initial weights of the MFMF transformations and .
while  has not converged or has not converged do
     for  do Manifold phase
         Sample Sample training data
         for  do
               Transform to latent space
               Project to manifold, transform back to data space          
          Compute reconstruction error
          Update manifold weights      
     for  do Density phase
         Sample Sample training data
         for  do
               Transform to latent space
               Project to manifold, transform to coordinate base space          
          Compute log likelihood
          Update density weights      
Algorithm 1

Alternating manifold / density (M/D) training for manifold-modeling flows. Instead of alternating between manifold and density phases as shown here, one can employ a sequential version in which the manifold is first trained until convergence, followed by a training phase focused on the density. For simplicity we show a version based on stochastic gradient descent with a constant learning rate, though the algorithm can trivially be extended to other optimizers and learning rate schedules (which may be different for the two phases).

Adversarial training (OT).

Another option is to train manifold-modeling flows adversarially, similar to GANs or Flow-GANs (20). The loss function is then a distance metric between samples generated from the MFMF model and a batch of training samples. Such a distance metric can for instance be based on the output of a discriminator that is trained simultaneously, or an integral probability metric such as the Wasserstein distance. We use unbiased Sinkhorn divergences, a tractable but positive definite approximation of Wasserstein divergences (21). In this training scheme, which we label OT, we iterate over the data in mini-batches , generate equally sized batches of samples from the manifold-modeling flow, and update the gradients based on the loss


Here the Sinkhorn divergence is defined as


with entropy-regularized optimal transport loss . interpolates between Wasserstein distance (for ) and energy distance (for ). See Reference (21) for a detailed explanation.

Alternating adversarial and likelihood training (OT/D).

We can combine this adversarial training with likelihood-based phases for the base density into an alternating algorithm. It is essentially the same as the M/D algorithm described in Algorithm 1, except that in the first phase we draw samples from the model as well and optimize the parameters of both and by minimizing the loss in (24).

Geometric implicit regularization.

Given a set of data points , it is possible to train a neural network that maps the data space to to learn a signed distance function from the data manifold. The level set then corresponds to the manifold. Reference (22) proposes to achieve this goal by minimizing


combining a term that favors the network to be zero on the data with an “Eikonal” term that encourages the gradients to be of unit norm everywhere, weighted by a hyperparameter

. The expectation is taken with respect to some probability distribution over


This ansatz can be applied to manifold-modeling flows. One approach would be to add a term like (26) for each component of the off-the-manifold latent variables to the existing loss functions,


Computing this regularization term then requires the evaluation of the Jacobian , which is plagued by the same computational inefficiency that we discussed for before. Nevertheless, the authors of Reference 26 report learned manifolds of a very high quality even for few training samples, and the computational expense may well be worth it. We leave an exploration of this idea for future work.

3.3 Likelihood evaluation

Above we discussed training strategies that avoid a computation of the expensive terms in the likelihood. Even with such an efficient training, the model likelihood often needs to be evaluated at test time, although typically not quite as often. Here we collect ideas for how to improve the efficiency of the likelihood evaluation.

Exact likelihood.

While the model likelihood in (14) is tractable, evaluating it for typical flow transformations can be somewhat slow. The cost of this evaluation increases with the dimension of the feature space as well as with the complexity of the network architecture. In our experiments we found that this cost is not the limiting factor when evaluating low- to medium-dimensional feature spaces, even in the context of inference problems that require many repeated evaluations of the likelihood. In this work we thus restricted ourselves to exact likelihood evaluations and did not study the methods described in the following further.

Approximate likelihood.

The likelihood in (14) can be computed approximately, for instance with the methods proposed in References (17, 23, 24). Instead of computing the full matrix , these methods just require calculating a number of matrix-vector products with randomly sampled vectors , which can be cheaper. Whether the gains in speed are worth the loss in precision from the approximation remains to be seen; we leave a test of this idea for future work.

Approximate lower bound on the likelihood.

The authors of Reference (8) derive a lower bound on the likelihood in (14). While the lower bound itself is computationally expensive, they derive a stochastic estimator for it that can be computed efficiently. Again we leave an exploration of the idea for our model for future work.

Regression on the Jacobian determinant.

The cost of evaluating the Jacobian determinant in (14) can be amortized by evaluating this Jacobian factor for a number of representative data points first, and then regressing on the function . Afterwards, the MFMF likelihood can be evaluated at any point efficiently. We leave an investigation of this idea for future work.

Optimized architectures.

The characterization of the evaluation of the Jacobian in (14) as computationally expensive depends on the architecture of the transformation . In this work we only consider zero-padding followed by typical diffeomorphic transformations like coupling layers with invertible elementwise transformations or permutations; these transformations have evolved over many years of research with the design goal of efficient standard flow densities in mind. It is well possible that a similar amount of research will unveil a class of transformations for which the terms in (14) can be computed efficiently without limiting their expressiveness. We hope that this paper can instigate research into such transformations.

4 Experiments

Figure 5: Learning a Gaussian density on a circle. Top left: true density of the data-generating process. Top middle and top right: 2D density learned by a standard ambient flow (AF) and a PIE). Bottom: manifold and density learned by a manifold flow with specified true manifold (FOM), a manifold-modeling flow (MFMF–M/D), and a manifold-modeling flow that was only trained on the reconstruction error (MFMF–AE). To highlight the differences, we use simple, less expressive architectures (see text).

We will now demonstrate manifold-modeling flows in two pedagogical examples. We plan to follow up with experiments focused on more realistic use cases.

A common metric for flow-based models is the model log likelihood evaluated on a test set, but such a comparison is not meaningful in our context. Since the MFMF variants evaluate the likelihood after projecting to the learned manifolds, the data space is different for every model and the likelihoods of different models may not even have the same units. Instead, we analyze the performance through the generative mode, evaluating the quality of samples generated from the models with different metrics depending on the data set. In addition, we use the model likelihood for inference tasks and gauge the quality of the resulting posterior.

4.1 Gaussian on a circle

First, we want to illustrate the different flow models in a simple toy example. Data is generated on a unit circle in two-dimensional space, where the usual polar angle is drawn from a Gaussian density with mean and standard deviation . To represent a slightly noisy true manifold, the radial coordinate is not exactly set to , but drawn from a Gaussian density with mean and standard deviation . As training data, we generate points in this way.

To highlight the difference between the different models, we purposefully limit the expressivity of the flows by using simple affine coupling layers interspersed with random permutations of the latent variables. For the ambient flow we use ten affine coupling layers, while for PIE and the MFMF variants we restrict to five such layers and model with a Gaussian with learnable mean and variance. We also consider a FOM model, using the known parameterization of the unit circle to model the manifold and a Gaussian with learnable mean and variance for

. Finally, for demonstration purposes we also include an MFMF model that is only trained on reconstruction error, essentially an invertible auto-encoder, in this study and label it MFMF–AE. In all cases, we limit the training to 120 epochs.

Figure 5 shows the true density of the data-generating process (top left) as well as the learned densities from different models (other panels). The standard flow (AF, top middle) learns a smeared-out version of the true density, with a substantial amount of probability mass away from the true manifold. Note that the AF results become much sharper when we train until convergence or switch to a state-of-the-art architecture, as we have tested with rational-quadratic neural spline flows (25). The PIE model (top right) also learns a smeared-out version, but its inductive bias leads to a sharper version than the AF model. We also show the manifold represented by the level set in the PIE model as a dotted black line, it is not in particularly good agreement with the true manifold.

Figure 6: Mixture model on a polynomial surface. Top: the true data manifold as well as the manifolds learned by the PIE, MFMF–M/D, and MFMF–OT models. The color shows the log likelihood for (bright yellow represents a high density, dark blue a low density). In order to increase the clarity of the PIE panel we have removed parts of that manifold which “fold” above and below the shown part. Bottom: ground truth and MFMF–M/D manifold for and .

In the bottom panels we show some algorithms with a model density restricted to the manifold, the black space in the figures thus shows the off-the-manifold region which are outside the support of the model. Note that this different support also means that the likelihood values between the top and bottom panels cannot be directly compared. The FOM model (bottom left), which requires knowledge of the manifold, perfectly captures both the shape of the manifold and the density on it. Our new MFMF–M/D algorithm (bottom middle) also parameterizes the density only on the manifold, but now the manifold is learned from data; we see both good manifold quality and good density estimation in the upper half of the circle, where most of the training data lie. In the lower part, where the density was too small to sample enough training data, the learned manifold departs from the true one. Finally, in the bottom right panel we show that training an MFMF model just on reconstruction error can lead to a good approximation of the manifold (where there is training data), but, of course, does not produce a reasonable density on this manifold.

4.2 Mixture model on a polynomial surface

Next, we consider a two-dimensional manifold embedded in defined by


Here is a three-dimensional rotation matrix, is a vector of two latent variables that parameterize the manifold, are the coefficients of a polynomial, and is the maximal power in the series. We choose a fixed value for and

for these experiments by a single random sampling from the Haar measure and normal distributions, the values of these parameters are given in the appendix.

We define a conditional probability density on the latent variables as


which together with the chart in (28) defines a probability density on the manifold. The dominant component of this mixture model is thus a normal distribution with a large covariance that is independent of the parameter , while only the covariance of the smaller component depends on the parameter , which is restricted to the range .

We train several manifold-modeling flow variants on training samples and compare to AF and PIE baselines. In all cases we use rational-quadratic neural spline flows with ten coupling layers interspersed with random permutations of the features. The setup is described in detail in the appendix.

We visualize the true data manifold and the estimated manifolds from a few MFMF and PIE models in Figure 6. In the top panels we compare the ground truth and three trained models conditional on , in the bottom panels we show how the ground truth and the MFMF–M/D model change for . The manifold defined by in the PIE model is clearly not a good approximation of the true manifold—these directions are only partially aligned with the true data manifold, and the surface defined in this way does not extend near a large part of the true data manifold at all. MFMF–OT gets some of the features of the manifold and density right, but does not perform very well in regions of low density. The results that most closely resemble the true model come from the MFMF–M/D model: not only are the learned manifold and the density of the manifold very similar to the ground truth, but the model accurately captures the dependency of the likelihood on the model parameter .

Model–algorithm Mean distance from manifold Mean reconstruction error Posterior MMD Out-of-distribution AUC
AF 0.005 0.071 0.990
PIE 0.006 1.253 0.075 0.972
MFMF–S 0.006 0.011 0.026 0.974
MFMF–M/D (alternating) 0.002 0.003 0.020 0.986
MFMF–M/D (sequential) 0.009 0.013 0.017 0.961
MFMF–OT 0.089 0.433 0.134 0.647
MFMF–OT/D (alternating) 0.142 1.121 0.051 0.584
MFMFE–S 0.005 0.006 0.033 0.975
MFMFE–M/D (alternating) 0.003 0.003 0.030 0.985
MFMFE–M/D (sequential) 0.002 0.002 0.007 0.987
Table 2: Results for the mixture model on a polynomial surface. We compare the sample quality of the different flows as given by their distance from the true data manifold (lower is better), the reconstruction error when projecting on the learned manifold (lower is better), the maximum mean discrepancy between MCMC samples based on the model and MCMC based on the true likelihood (lower is better), and the AUC when discriminating test samples from a second out-of-distribution test set (higher is better). Out of five runs with independent training data and initializations we show the median. The best three results, which are generally consistent with each other within the variance observed in the five runs, are shown in bold.

In Table 2 we evaluate the performance of the models on four metrics:

  • We compare the quality of samples generated from the flows by calculating the mean distance from the true data manifold using (28), as described in the appendix.

  • For all models except the AF we calculate the mean reconstruction error when projecting test samples onto the learned manifold.

  • We use the flow models for approximate inference on the parameter . We generate posterior samples with an MCMC sampler, using the likelihood of the different flow models in lieu of the true simulator density. The results are compared to posterior samples based on the true simulator likelihood. We summarize the similarity with the maximum mean discrepancy (MMD) of the posterior samples based on a Gaussian kernel (26).

  • Finally, we evaluate out-of-distribution (OOD) detection. For each model, we compare the distribution of log likelihood and reconstruction error between a normal test sample based on (29) and an OOD sample. The latter is based on the same density as the original model plus Gaussian noise with zero mean and standard deviation of 0.1 on all three features, pushing it off the data manifold of the regular training and test samples. We report the area under the curve (AUC), giving the larger number when both discrimination based on model likelihood and reconstruction error is available.

For each metric, we report the median based on five runs with independent training samples and weight initializations.

In all metrics except the out-of-distribution detection, manifold-modeling flows provide the best results. In particular, samples generated from the MFMF–M/D and MFMFE–M/D models are closest to the true data manifold and most faithfully reconstruct test samples after projecting them to the learned manifold. These algorithms, which perform comparably within the variance between the runs, clearly outperform the AF and PIE baselines when it comes to inference on . The reconstruction error returned by these manifold-modeling flows is not quite as good as the AF log likelihood when it comes to out-of-distribution detection. The other training algorithms all have their shortcomings: MFMF–S training is not only slower, but also leads to slightly worse results, and the optimal transport variants MFMF–OT and MFMF–OT/D do not perform well on any metric, perhaps signalling the need for a more thorough tuning of hyperparameters.

5 Related work

Our work is closely related to a number of different probabilistic and generative models. We have discussed the relation to normalizing flows, autoencoders, variational autoencoders, generative adversarial networks, and energy-based models in the introduction and in Section 2. In addition, manifold learning is its own research field with a rich set of methods (27), though these typically do not model the data density on the manifold and thus do not serve quite the same purpose as the models discussed in this paper. In the following we want to draw attention to three particularly closely related works and describe how our approach differs from them.

Injective flows.

The work most closely related to manifold-modeling flows are relaxed injective probability flows (8), which appeared while this paper was in its final stages of preparation. The proposed model is similar to our manifold-modeling flows with a separate encoder (MFMFE). A key difference is the way in which the invertibility of the decoder is enforced. The authors of Reference (8) bound the norm of the Jacobian of an otherwise unrestricted transformation . While this makes the transformation in principle invertible (up to the possibility of multiple points in latent space pointing to the same point in data space), the inverse of and the likelihood of this model are not tractable for unseen data points. This makes their algorithm unsuitable for inference tasks. As the authors point out, their model also cannot deal with points off the learned manifold. We address these issues by drawing from the flow literature and defining the decoder as the level set of a diffeomorphism, which is by construction exactly invertible. We also add a prescription for evaluating off-the-manifold points with a projection to the manifold, which naturally provides a measure of distance between the data point and the manifold.

Similar to our discussion in Section 3.1, the authors of Reference (8) also argue that training an injective flow by maximum likelihood is infeasible due to the computational cost of evaluating the Jacobian of . They propose a different training objective that is based on a stochastic approximation of a lower bound of the likelihood, which can be computed efficiently. We point to this training strategy in our discussion in Section 3.2, but realized that the alternating procedure allows us to sidestep the problem. Finally, their motivation is different from ours: while we develop MFMFs specifically to better represent the true structure of the data, they focus on the reduced computational complexity of the model due to a lower-dimensional space; they view the lack of support of the model off the manifold as a deficiency rather than an advantage. In addition to these qualitative differences, it would be interesting to compare relaxed injective probability flows and manifold-modeling flows quantitatively.

Pseudo-invertible autoencoder.

Another closely related model is the pseudo-invertible autoencoder (PIE) (19), which we define and discuss in Section 2 and use as a baseline in our experiments. The key difference to our MFMF setup is that the PIE model describes a density over the ambient data space, while MFMF limits the density strictly to the manifold. In this sense the PIE approach is much more similar to a standard ambient flow, though it adds a multi-scale architecture and different base densities for the latent variables that correspond to the manifold coordinates and the off-the-manifold latents. In addition to this fundamental difference in construction, PIE and MFMF models are trained differently: for PIE maximum likelihood is sufficient, while for MFMF we discuss the shortcomings of that objective and propose several new training schemes.

Flows on manifolds.

Finally, the MFMF is closely related to normalizing flows on (prescribed) manifolds (6) (FOM). In particular, the likelihood equation is almost the same, with the crucial exception that manifold flows require knowing a parameterization of the manifold in terms of coordinate and a chart, while the MFMF algorithm learnes these from data. Since in many real-world cases the manifold is not known, MFMF models are applicable to a much larger class of problems than FOM.

Our contributions.

This paper contains four main contributions:

  1. We propose manifold-modeling flows (MFMF and MFMFE).

  2. We identify a subtlety in the naive interpretation of the density of such models and argue that they should not be trained by naive maximum likelihood alone. We address this issue with the new manifold / density (M/D) training strategy, which separates manifold and density updates. This both reduces the computational cost of the likelihood evaluation during training as well as avoids issues with potential pathological configurations. We also discuss training strategies based on adversarial training and optimal transport.

  3. We demonstrate these models and training algorithms in two pedagogical examples.

  4. Beyond the newly proposed algorithms we provide a general discussion of the relation between different generative models and the data manifold, reviewing ambient flows, injective flows, flows on manifolds, PIEs, VAEs, and GANs in a common language. In particular, we identify an inconsistency between training and data generation for PIE models.

6 Conclusions

In this work we introduced manifold-modeling flows (MFMFs), a new type of generative model that combines aspects of normalizing flows, autoencoders, and GANs. MFMFs describe data as a probability density over a lower-dimensional manifold embedded in data space. Unlike flows on prescribed manifolds, they learn a chart for the manifold from the training data. MFMFs allow generating data in a similar way to GANs while maintaining a tractable exact density. They also provide a prescription for evaluating points off the manifold by first projecting data onto the manifold. The MFMF approach may not only represent data sets with manifold structure more accurately, but also allow us to use lower-dimensional latent spaces than with ambient flows, reducing the memory and computational footprint. As an added benefit, the projection to the manifold may be useful for denoising or to detect out-of-distribution samples. We introduced two variants of this new model, one of which features a separate encoder while the other uses the inverse of the decoder directly, and broadly reviewed the relation between several types of generative models and the structure of the data manifold.

Despite the tractable density, training MFMF models is nontrivial: any update of the manifold modifies the data variable that the density is describing, rendering training by naive maximum likelihood invalid. In addition, computing the full model likelihood can be expensive. We reviewed several training and evaluation strategies that mitigate this problem. In particular, we introduced the new M/D training schedule, which separates manifold and density updates and solves both stability and training issues. We also presented an adversarial training scheme based on optimal transport as well as a hybrid version that alternates between adversarial phases and density updates.

In two pedagogical experiments, we demonstrated how this approach lets us learn the data manifold and a probability density on it. MFMF models allowed us to reconstruct the data manifold with a substantially higher quality than PIE and outperformed ambient flow and PIE baselines on a downstream inference task. Our experiments were so far limited to problems with a low dimensionality. While training manifold-modeling flows on high-dimensional data sets such as high-resolution images is straightforward and the generative mode remains efficient, inference tasks will become more challenging as the exact likelihood evaluation becomes increasingly expensive. In this paper we have laid out multiple strategies that can help mitigate this cost.

Problems in which data populates a lower-dimensional manifold embedded in a high-dimensional feature space are almost everywhere. In some scientific cases, domain knowledge allows for exact statements about the dimensionality of the data manifold, and MFMFs can be a particularly powerful tool in a likelihood-free or simulation-based inference setting (28). Even in the absence of such domain-specific insight this approach may be valuable: GANs with low-dimensional latent spaces are powerful generative models for numerous data sets of natural images, which is testament to the presence of a low-dimensional data manifold. Flows that simultaneously learn the data manifold and a tractable density over it may help us to unify generative and inference tasks in a way that is well-suited to the structure of the data.


We would like to thank Jens Behrmann, Jean Feydy, Michael Kagan, George Papamakarios, Merle Reinhart, Frank Rösler, John Tamanas, and Andrew Wilson for useful discussions. We are grateful to Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios for publishing their excellent neural spline flow codebase (25), which we used extensively in our analysis. Similarly, we want to thank George Papamakarios, David Sterratt, and Iain Murray for publishing their Sequential Neural Likelihood code (29), parts of which were used in the evaluation steps in our experiments. We are grateful to the authors and maintainers of Delphes 3 (30), GeomLoss (21), Jupyter (31), MadMiner (32), Matplotlib (33), NumPy (34), Pythia8 (35), PyTorch (36), scikit-learn (37), and SciPy (38)

. This work was supported by the National Science Foundation under the awards ACI-1450310, OAC-1836650, and OAC-1841471; by the Moore-Sloan data science environment at NYU; and through the NYU IT High Performance Computing resources, services, and staff expertise.


Appendix A Experiment details

In our second experiment, the manifold is defined by (28). We use the randomly drawn polynomial coefficients


and the rotation matrix


For the training data set we draw parameter points from a uniform prior, , while for the test set we generate data for .

We implement all generative models as rational-quadratic neural spline flows with coupling layers alternating with random permutations (25). For standard flows we use ten coupling layers, for PIE, MFMF, and MFMFE models five layers for the transformation (which also defines through a level set), and five layers for the . For the PIE model we use an off-the-manifold base density with standard deviation . In each coupling transform, half of the inputs are elementwise transformed with a monotonic rational-quadratic spline, the parameters of which are determined from a residual network with two residual block of two hidden layers each, 100 units in each layer, and

activations throughout. We do not use batch normalization or dropout since we found that the stochasticity they induce can lead to issues with the invertibility of the transformations. The splines are constructed in ten bins of each variable, distributed over the range


All models are trained with the Adam optimizer, with an initial learning rate of and cosine annealing, and weight decay of . To balance the sizes of the various terms in the loss functions, we multiply them with different weights. For the manifold phase of the M/D training, we weight the mean reconstruction error with a factor . In the S training we use the the mean negative log likelihood weighted with a factor of plus the mean reconstruction error weighted with a factor of . For OT training we multiply the Sinkhorn divergence (defined with ) with 10. We train for 50 epochs with a batch size of 100 (1000 for the OT training). We study sequential as well as alternating versions of the M/D algorithm, where in the latter case we alternate between training phases after every epoch. We save the weights after each epoch and use the set of weights that leads to the smallest validation loss.

We evaluate generated samples by undoing the rotation, , and evaluating the distance in direction to the manifold as . For the inference task we use a Metropolis-Hastings MCMC sampler based on the different flow likelihoods. We consider a synthetic “observed” data set of 10 i. i. d. samples generated for . For each model, we generate an MCMC chain of length 5000, with a Gaussian proposal distribution with mean step size 0.15 and a burn in of 100 steps.