manifold-flow
Manifold-learning flows (ℳ-flows)
view repo
We introduce manifold-modeling flows (MFMFs), a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold. Combining aspects of normalizing flows, GANs, autoencoders, and energy-based models, they have the potential to represent data sets with a manifold structure more faithfully and provide handles on dimensionality reduction, denoising, and out-of-distribution detection. We argue why such models should not be trained by maximum likelihood alone and present a new training algorithm that separates manifold and density updates. With two pedagogical examples we demonstrate how manifold-modeling flows let us learn the data manifold and allow for better inference than standard flows in the ambient data space.
READ FULL TEXT VIEW PDFManifold-learning flows (ℳ-flows)
NFs implementations
Inferring a probabilistic model from some example data points is a common problem that is increasingly often tackled with deep generative models. Generative adversarial networks (GANs) (1) and variational autoencoders (VAEs) (2) are both based on a lower-dimensional latent space and a learnable mapping from that to the data space. In essence, these models describe a lower-dimensional data manifold embedded in the data space. While they allow for efficient sampling, their probability density (or likelihood) is intractable, leading to a challenge for training and limiting their usefulness for inference tasks. On the other hand, normalizing flows (3, 4, 5) are based on a latent space with the same dimensionality as the data space and a diffeomorphism; their tractable density permeates through the full data space and is not restricted to a lower-dimensional surface.
The flow approach may be unsuited to data points that do not populate the full feature space they are parameterized in, but are restricted to a lower-dimensional data manifold. Normalizing flows are by construction not able to represent such a data structure exactly, instead they learn a smeared-out version with support off the data manifold. We illustrate this in the left panel of Figure 1, where the black dots represent 2D data populating a 1D manifold and the orange surface sketches the density learned by a normalizing flow. In addition, the requirement of latent spaces with the same dimension as the data space increases the memory footprint and computational cost of the model. While flows have been generalized from Euclidean feature spaces to Riemannian manifolds (6), this approach has so far been limited to the case where the chart for the manifold is prescribed.
Here we introduce manifold-modeling flows (MFMF): normalizing flows based on an injective, invertible map from a lower-dimensional latent space to the data space. MFMFs simultaneously learn the shape of the data manifold, provide a tractable bijective chart, and learn a probability density over the manifold, as sketched in the right panel of Figure 1. When evaluating the model, the input (which may be off the manifold) is first projected onto the manifold and the model returns both the distance from the manifold as well as the density on the manifold after the projection.
The MFMF approach marries aspects of normalizing flows, GANs, and autoencoders. Compared to flows on prescribed manifolds, this approach relaxes the requirement of knowing a closed-form expression for the chart from latent variables to the data manifold and instead learns the manifold from data. In contrast to GANs and VAEs, it not only provides an exact tractable likelihood over the data manifold, but also a prescription for how to treat points off the manifold. In contrast to standard autoencoders, it is a probabilistic model with a generative mode and tractable density. Similar to invertible autoencoders (7), MFMFs ensure that for data points on the manifold the encoder and decoder are the inverse of each other. They can also be seen as regularized autoencoders (8, 9). Compared to standard flow-based generative models, MFMFs offer five advantages:
Manifold-modeling flows may more accurately approximate the true data distribution, avoiding probability mass off the data manifold. This in turn could lead to performance gains in inference and generative tasks.
The model architecture naturally allows one to model a conditional density that lives on fixed manifold. This should improve data efficiency in such situations as it is ingrained in the architecture and does not need to be learned.
The lower-dimensional latent space reduces the complexity of the model, allowing us to use more expressive transformations or scale to higher-dimensional data spaces within a given computational budget.
The projection onto the data manifold provides dimensionality reduction and denoising capabilities. The distance to the manifold may also be useful to detect out-of-distribution samples.
The MFMF model embraces the idea of energy-based models (10, 11, 12) for dealing with off-the-manifold issues through a non-probabilistic distance measure, while retaining a tractable density on the data manifold. Similarly, we can link it to the development of adversarial objectives for GANs: the original GAN setup (1), in which a generator is pitted against a discriminator, corresponds to training based on a proxy for the likelihood ratio. Off the data manifold this density ratio is not well-defined, which makes the training challenging. Wasserstein GANs (13) address this issue by measuring distances between two data manifolds in feature space. Similarly, the likelihood of normalizing flows is not appropriate when data populates a lower-dimensional manifold; the MFMF model augments flows with a distance measure in feature space to measure closeness to the data manifold.
Training an MFMF model faces two challenges. First, maximum likelihood is not enough: we will demonstrate that the training dynamics for naive likelihood-based training may not lead to a good estimate of the manifold and the density on it. Second, evaluating the MFMF density can be computationally expensive. We will discuss several new training strategies that solve these challenges. In particular, we introduce a new training scheme with separate manifold and density updates, which allows for a computationally efficient training of MFMF models and incentivizes both good manifold quality and good density estimation on the manifold.
We begin with a broad discussion of the notion of data manifolds in different generative models and introduce manifold-modeling flows in Section 2. In Section 3 we discuss pitfalls when training MFMF models and introduce training strategies that can overcome these challenges. In Section 4 we demonstrate MFMF in some experiments. We comment on related work in Section 5 before summarizing the results in Section 6. The code used in our study is available at http://github.com/johannbrehmer/manifold-flow.
Consider a true data-generating process that draws samples according to , where is a -dimensional Riemannian manifold embedded in the -dimensional data space and . We consider the two problems of estimating the density as well as the manifold given some training samples . We will later extend our models with a projection to the manifold so that they can also handle problems in which the data is only approximately restricted to a manifold.
In the following we will discuss which types of generative models address which parts of this problem and generally discuss the relation between the data manifold and various classes of generative models. In the process we will also introduce the new manifold-modeling flows (MFMF). We distinguish between three different classes of models:
manifold-free models defined in the ambient space ,
models for an explicitly prescribed manifold, and
models that learn an unknown manifold.
In this discussion we rely on a few simplifying assumptions. We treat the manifold as topologically equivalent to ; in particular, we assume that it is connected and can be described by a single chart. We also assume that the dimensionality of the manifold is known. In Section 2.2.4 we will discuss how these requirements can be lifted.
To facilitate a straightforward comparison, we will describe all generative models in terms of two vectors of latent variables
and , where is the latent space that maps to the learned manifold , i. e. the coordinates of the manifold. parameterizes any remaining latent variables, representing the directions “off the manifold”.In Figure 2 we sketch the setup of the different models. In Table 1 we summarize some of their properties.
In these conventions, a standard Euclidean normalizing flow (5) in the ambient data space is a diffeomorphism
(1) |
together with a tractable base density (such as a multivariate unit Gaussian) . According to the change-of-variable formula, the density in is then given by
(2) |
where is the Jacobian of , a matrix.
is usually implemented as a neural network with certain constraints that make
efficient to compute. In the generative mode, flows sample and from their base densities and apply the transformation , leading to samples .There is typically no difference between and . While some models employ multi-scale architectures where some latent variables have more transformations applied to them than others (4), there is no explicit incentive for the network to align these directions in the latent space with coordinates on the data manifold and off-the-manifold directions, respectively. This model therefore has no notion of a data manifold, they only describe regions of varying probability density in the overall ambient feature space. We will therefore refer to it as ambient flow (AF).
When a chart (or an atlas of multiple charts) for the manifold is known a priori, one can construct a flow on this manifold (6). If a diffeomorphism
(3) |
is the sole chart for the manifold, the density on the manifold is given by
(4) |
where is the Jacobian of , an matrix. The latent variables are the coordinates of the manifold. The density in this coordinate space can then be modeled with a regular normalizing flow in dimensions with a learnable diffeomorphic transformation
(5) |
and a base density . Then
(6) |
where is the Jacobian of .
Sampling from such a flow is straightforward and consists of drawing from the base density and transforming the variable with and then . Depending on the choice of chart, the model likelihood in (4) can be evaluated efficienty, and the model is by construction limited to the true manifold. This approach has been worked out for spheres and tori of arbitrary dimension (14), for hyperbolic manifolds (15), as well as for a problem in theoretical physics where the manifold consists of a particular product of groups (16).
GANs map an -dimensional latent space to the data space,
(7) |
Here is a learnable map like a deep neural network rather than a prescribed closed-form chart. This map is neither restricted to be injective not invertible: there can be multiple that correspond to the same data point . Therefore is not a chart and the image of this transformation not necessarily a Riemannian manifold, though in practice this distinction is not relevant and we will simply call this subset a manifold.
While the lack of restrictions on increases the expressivity of the neural network, it also makes the model density intractable. This drawback has two immediate consequences. First, GANs have to be trained adversarially as opposed to by maximum likelihood. Second, despite their built-in manifold-like structure GANs are neither well-suited for inference tasks that require to evaluate the model density nor for manifold learning tasks.^{1}^{1}1Reference (17) introduces a method that allows to calculate the GAN density at least approximately, though this approach neglects the possibility of multiple pointing to the same . PresGANs (18) add a noise term to the generative procedure, similar to a VAE, as well as a numerical method to evaluate the model density approximately using importance sampling. Finally, note that in conditional GANs both the shape of the manifold as well as the implicit density on it generally depend on the variables being conditioned on.^{2}^{2}2To fix the manifold but let the density on it be conditional, one could make independent of and model with a conditional density estimator such as a normalizing flow. Such a partially conditional GAN setup has, to the best of our knowledge, not yet been explored in the literature.
Variational autoencoders also map a lower-dimensional latent space to the data space, but instead of a deterministic function they use a stochastic decoder . The marginal density
(8) |
of the model therefore extends off the manifold into the whole space . This marginal density itself is intractable, though there is a variational lower bound (the ELBO) for it that is commonly used as a training objective.
Nevertheless, the lower-dimensional latent space of a VAE is often associated with a learned data manifold. Often only the final step in the decoder is stochastic, for instance as a Gaussian density in data space where the mean is a learned function of the latent variables. Then one can define an alternative generative mode by using this mean instead of sampling from the Gaussian, replacing the stochastic decoder with a deterministic one. In this way the generated samples are restricted to a lower-dimensional subset . While not strictly a manifold, for all practical purposes it is equivalent to one. However, generating in this mode does not correspond to sampling from , which was used to train the model.
Model | Manifold | Chart | Generative mode | Tractable density | Restricted to manifold |
---|---|---|---|---|---|
Ambient flow (AF) | no manifold | ||||
Flow on manifold (FOM) | prescribed | ||||
Generative adversarial network (GAN) | learned | ||||
Variational autoencoder (VAE) | learned | only ELBO | () | ||
Pseudo-invertible encoder (PIE) | learned | () | |||
Slice of PIE | learned | up to normalization | |||
Manifold-modeling flow (MFMF) | learned | (may be slow) | |||
Manifold-modeling flow with sep. encoder (MFMFE) | learned | (may be slow) |
One way to give ambient flows a notion of a (learnable) manifold is to treat some of the latent variables differently from others and rely on the training to align one class of latent variables with the manifold coordinates and the other class of latent variables with the off-the-manifold directions. This is the essential idea behind the Pseudo-Invertible Encoder (PIE) architecture (19). Its basic setup is given by the flow transformation of (1) and the flow density in (2). The key difference is that in PIE one chooses different base densities for the latent variables , which are designated to represent the coordinates on the manifold, and , which should learn the off-the-manifold directions in latent space: the base density is modeled with an -dimensional Euclidean flow, i. e. a transformation that maps it to another latent variable associated with a standard base density such as a unit Gaussian. The off-the-manifold base density is chosen such that it sharply peaks around
, for instance as a Gaussian with a small variance
in each direction.For sufficiently flexible transformations, this architecture has the same expressivity as an ambient flow, independent of the orientation of the latent space. In particular, a single scaling layer can learn to absorb the difference in base densities, allowing the flow to squeeze any region of data space into the narrow base density and thus fit the data equally well independent of how the latent variables and are aligned with the data manifold. From that perspective it does not seem like PIE is actually a different model than AF. Yet somehow in practice learning dynamics and the inductive bias of the model seem to couple in a way that favor an alignment of the level set with the data manifold. Understanding these dynamics better would be an interesting research goal.
In many ways, PIE walks and quacks like an ambient flow. In particular, the model density in (2) generally has support over the full data space , extending beyond the manifold. To sample from this density, one would still draw and and apply a transformation .
However, the labelling of different latent directions as manifold coordinates and off-the-manifold directions gives us some new handles. The authors of Reference (19) define a generative mode that samples data only from the learned manifold: one samples as usually, but fixes , and then applies the transformation . This is similar to sampling from the learned manifold for a VAE when the Gaussian mean is used as a deterministic encoder. If the inductive bias of the PIE model successfully leads to an alignment of with the manifold coordinates, this allows us to sample only from the manifold. Note, however, that the density defined by this sampling procedure is not the same as the tractable density .^{3}^{3}3Sampling with corresponds to the density in (14), not to the one in (2). Even when restricted to , these two densities need not even be proportional to each other. To see this explicitly, we can write the Jacobian of as in column notation. Then for we have
is therefore inconsistent. Finally, note that the hyperparameter
allows us to smoothly interpolate between an ambient flow (
) and “manifolds” ().The PIE architecture defines a density over the full data space, and the level set defines a manifold . It may therefore be tempting to study the density on induced by , which is defined as
(9) |
While the normalizing integral in (9) cannot be computed efficiently, with (2) we can compute easily enough, so this likelihood is tractable up to an unknown normalizing constant. Depending on the task, this may or may not be sufficient.
We now introduce the main new algorithm of this paper: the manifold-modeling flow (MFMF). It combines the learnable manifold aspect of GANs with the tractable density of flows on manifolds (FOM) without introducing inconsistencies between generative mode and the tractable likelihood. We begin by modeling the relation between the latent space and data space with a diffeomorphism
(10) |
just as for an ambient flow or PIE. We define the model manifold through the level set
(11) |
In practice, we implement this transformation as a zero padding followed by a series of invertible transformations,
(12) |
where
(13) |
denotes padding a -dimensional vector with zeros and the invertible transformations operate in -dimensional space. Viewed as a map from the latent space to the data space , the transformation is injective and (when restricted to its image ) invertible.
Just as for FOM and PIE, we model the base density with an -dimensional flow , which maps to another latent variable with an associated tractable base density . There is no need for a base density over the off-the-manifold variables in this approach. The induced probability density on the manifold is then given by
(14) |
This is the same as (4), except with a learnable transformation rather than a prescribed, closed-form chart. This model density is defined only on the manifold and normalized to the manifold, .
Sampling from an MFMF is straightforward: one draws and pushes the latent variable forward to the data space as followed by , leading to data points on the manifold that consistently follow .
As a final ingredient to the MFMF approach, we add a prescription for evaluating arbitrary points , which may be off the manifold. As we illustrate in Figure 3, maps from a low-dimensional latent space to the data space and is thus essentially a decoder. We define a matching encoder as followed by a projection to the component:
(15) |
with . This extends the inverse of (which is so far only defined for ) to the whole data space . Similar to an autoencoder, combining and allows us to calculate a reconstruction error
(16) |
which is zero if and only if . Unlike for standard autoencoders, the encoder and decoder are exact inverses of each other as long as points on the manifold are studied.
For an arbitrary , an MFMF thus lets us compute three quantities:
The projection onto the manifold , which may be used as a denoised version of the input.
The reconstruction error
, which will be important for training, but may also be useful for anomaly detection or out-of-distribution detection.
The likelihood on the manifold after the projection, .
In this way, MFMFs separate the distance from the data manifold and the density on the manifold—two concepts that easily get conflated in an ambient flow. MFMFs embrace ideas of energy-based models for dealing with off-the-manifold issues, but still have a tractable, exact likelihood on the learned data manifold. Figure 3 summarizes how an MFMF model evaluates a data point by transforming to the latent space, projecting onto the manifold (where the density is evaluated), and transforming back to data space (where the reconstruction error is calculated).
Finally, we introduce a variant of the MFMF model where instead of using the inverse followed by a projection as an encoder, we encode the data with a separate function
(17) |
This encoder is not restricted to be invertible or to have a tractable Jacobian, potentially increasing the expressiveness of the network. Just as in the MFMF approach, for a given data point , the MFMFE model returns a projected point onto the learned manifold , a reconstruction error , and the likelihood on the manifold evaluated after the projection
(18) |
The added expressivity of this encoder comes at the price of potential inconsistencies between encoder and decoder, which the training procedure will have to try to penalize, exactly as for a standard autoencoders and similar to VAEs.
So far we have made two key assumptions to simplify the learning problem: that we know the manifold dimensionality and that the manifold is topologically equivalent to (in particular that it can be mapped by a single chart). The algorithms presented above can be extended to the more general case where these assumptions are relaxed.
If the dimension of the manifold is not known, a brute-force solution would be to scan over values of and train algorithms for each value. A common metric for flow-based models is the model log likelihood evaluated on a number of test samples, but that criterion is not admissible in this context since the space of the data (and the units of the likelihood) are different for different values of . However, we can compare models with different manifold dimensionality based on the reconstruction error, as well as on downstream tasks such as evaluating the quality of generated samples or the performance on inference tasks. A drop in performance is expected when the model manifold becomes smaller than the true manifold dimension.
Alternatively, for the PIE algorithm one could use trainable values of the base density variance along each latent direction, with suitable regularization favoring values close to 0 or 1. In this way the model can learn the manifold dimensionality directly from the training data.
If the manifold consists of multiple disjoint pieces, potentially with different dimensionality, a mixture model with separate transformations from latent space to data space may work. It remains to be seen if such a model is easy to train. See Reference (14) for a discussion of such issues.
Having defined the MFMF model, we will now turn to the question is how to train it. Most flow-based generative models are trained by maximum likelihood, with architectures commonly designed with the goal of making the likelihood in (2) efficient to evaluate. For implicit generative models that is not available: GANs are trained adversarially, for instance pitted against a discriminator or using an optimal transport (OT) metric, while VAEs are commonly trained on a lower bound for the marginal likelihood (the ELBO). We will draw on all of these approaches, beginning with a discussion of two challenges of likelihood-based training for MFMFs in Section 3.3.1. We discuss a number of more promising training strategies in Section 3.3.2, before commenting on steps to also make the evaluation of the likelihood more efficient in Section 3.3.3.
Since the MFMF model has a tractable density, maximum likelihood is an obvious candidate for a training objective. However, the situation is more subtle as the MFMF model describes the density after projecting onto the learned manifold. The definition of the data variable in the likelihood hence depends on the weights of the manifold-defining transformation , and a comparison of naive likelihood values between different configurations of is meaningless. Instead of thinking of a likelihood function , where are the weights of the transformation , it is instructive to think of a family of likelihood functions parameterized by the different .
Training MFMFs by simply maximizing the naive likelihood is therefore not meaningful, does not incentivize the network to learn the right shape of the manifold, and probably will not converge to the true model. As an extreme example, consider a model manifold that is perpendicular to the true data manifold. Since this configuration allows the MFMF to project all points to a region of very high density on the model manifold, this pathological configuration may lead to a very high naive likelihood value.
We demonstrate this issue in Figure 4 in a simple toy problem. The feature space is two-dimensional, the model manifold consists of a line through the origin with variable angle such that corresponds to a manifold aligned with the -axis and to a manifold aligned with the -axis. On this line we consider a one-dimensional Gaussian probability density with mean at the origin and standard deviation . Training samples are generated from and . The setup is sketched in Figure 3(a). In the top left panel of Figure 3(b) we show how the naive likelihood of this model over the training data depends on the parameters and . When fixing the manifold to the true value , the correct standard deviation indeed maximizes the naive likelihood. However, the model can achieve an even higher naive likelihood for and , representing a manifold that is orthogonal to the true one and projects all data points to a region of extremely high density on the manifold. In this limit the likelihood is in fact unbounded from above. Clearly, maximizing the naive alone is not very good at incentivizing the model to learn the correct manifold.
To address this, we can add a second training objective that is responsible for learning the manifold. A suitable candidate is the reconstruction error discussed in the previous section. The top right panel in Figure 3(b) shows the mean reconstruction error as a function of the model parameters, which is indeed minimal for the true configuration.
One way to combine the two metrics is training on a combined loss that sums reconstruction error and negative naive log likelihood, with hyperparameters weighting the two terms. This helps, but does not really solve the problem. In our toy example we show such a combined loss in the bottom left of Figure 3(b). While the correct configuration is a local minimum of this loss, the wrong minimum at and still exists and leads to a lower (and unbounded from below) loss. In general the correct solution might not even be a local minimum of such a combined loss function. When training the model parameters by minimizing this combined loss, the gradient flow may take the model to the correct solution or a pathological configuration, depending on the initialization and the choice of hyperparameters.
A better strategy is to separate the model parameters that define the manifold from the ones that only describe the density on them. In the MFMF setup in the previous section, the parameters of the transformation make up the first class, while the parameters of (or ) are in the second; in the toy example in Figure 4 fixes the manifold and the density on it. We can then update the manifold parameters based on only the reconstruction error and update the density weights based on only the log likelihood. In Figure 3(b) this corresponds to horizontal steps in the top right panel and vertical steps in the bottom right panel, where we show the log likelihood normalized to the maximum likelihood estimator (MLE) for each value of . Such a training procedure is not prone to the gradient flow leading the model to a pathological configuration. In the limit of infinite capacity, sufficient training data, and succesful optimization, it will correctly learn both the manifold and the density on it.
The second challenge is the computational efficiency of evaluating the MFMF density in (14). While this quantity is tractable, it cannot be computed as cheaply as the ambient flow density of (2). The underlying reason is that since the Jacobian is not square, it is not obvious how the determinant can be decomposed further. In particular, when the map consists of multiple functions as given in (12), the Jacobian is given by a product of individual Jacobians, . While the are invertible matrices, the Jacobian that represents the zero-padding is a rectangular matrix that consists of a identity matrix padded with zeros, which leaves us with the following Jacobian to calculate:
(19) |
This determinant can
be computed explicitly. However, when we compose an MFMF out of invertible transformations that have been designed for standard flows—coupling layers with invertible elementwise transformations, autoregressive transformations, permutations, or invertible linear transformations—evaluating this MFMF density requires the computation of all entries of the Jacobians of the individual transformation. This is a much larger computational effort than in the case of standard flows, where the overall log determinant can be split into a sum over the log determinants of each layer, which in turn can usually be written down as a single number without having to compute all elements of a Jacobian first.
While the computational cost of evaluating (19) is often reasonable for the evaluation of a limited number of test samples, it can be prohibitively expensive during training, which often requires many more evaluations. Since the computational cost grows with increasing data dimensionality , training by maximizing does not scale to high-dimensional problems.
Fortunately, gradient updates do not always require computing the full likelihood of the model. In particular, consider the training procedure introduced in the previous section, where we update the parameters of by minimizing the reconstruction error and update the parameters of (and thus of ) by maximizing the log likelihood. The manifold update phase does not require computing the log likelihood at all. For the density update, the loss functional is given by
(20) |
with . However, the last term (which is slow to evaluate) does not depend on the parameters of and does not contribute to the gradient updates in this phase! We can therefore just as well train the parameters of by minimizing only the first two terms, which can be evaluated very efficiently.
For completeness we include here the simultaneous optimization of the parameters of the manifold-defining transformation and the parameters of the density-defining transformation on a combined loss summing negative naive log likelihood and reconstruction error,
(21) |
where is a hyperparameter.
Following the discussion in Section 3.1, we do not expect this algorithm to perform very well. First, as demonstrated in the toy example in Figure 4 there is a risk of pathological models with poor manifold quality and poor density estimation for which this loss is very small, potentially even lower than for the true model. Which configuration the model ends up in may critically depend on the initialization and the learning dynamics. Second, evaluating this loss can be computationally expensive, especially for high-dimensional problems. Nevertheless, we include this in our experiments on low-dimensional data for comparison.
In order to ameliorate both the potential instability of this training objective and to speed up the training, we add a pre-training and a post-training phase. In the pre-training, the model is trained by minimizing the reconstruction error only, hopefully pushing the weights of into the basin of attraction around the true model configuration, before the main training phase begins. In the post-training phase, the parameters of are fixed and only the parameters of are updated by minimizing only the relevant terms in the loss.
As discussed above, we expect a both faster and more robust training when separating manifold and density updates, splitting the training into two phases:
Update only the parameters of (and thus also , which is defined as a level set of ) by minimizing the reconstruction error from the projection onto the manifold,
(22) |
with batch size . For the MFMFE model, the parameters of the encoder are also updated during this phase.
Update only the parameters of (which define the coordinate density ) by minimizing the negative log likelihood
(23) |
An important choice is how these two phases are scheduled. The most straightforward strategy is a sequential training, in which the manifold-defining transformation is learned first, followed by the density-defining transformation . We also experiment with an alternating scheme, where we switch between the two phases after a fixed number of gradient updates. The algorithm is described in more detail in Algorithm 1.
Another option is to train manifold-modeling flows adversarially, similar to GANs or Flow-GANs (20). The loss function is then a distance metric between samples generated from the MFMF model and a batch of training samples. Such a distance metric can for instance be based on the output of a discriminator that is trained simultaneously, or an integral probability metric such as the Wasserstein distance. We use unbiased Sinkhorn divergences, a tractable but positive definite approximation of Wasserstein divergences (21). In this training scheme, which we label OT, we iterate over the data in mini-batches , generate equally sized batches of samples from the manifold-modeling flow, and update the gradients based on the loss
(24) |
Here the Sinkhorn divergence is defined as
(25) |
with entropy-regularized optimal transport loss . interpolates between Wasserstein distance (for ) and energy distance (for ). See Reference (21) for a detailed explanation.
We can combine this adversarial training with likelihood-based phases for the base density into an alternating algorithm. It is essentially the same as the M/D algorithm described in Algorithm 1, except that in the first phase we draw samples from the model as well and optimize the parameters of both and by minimizing the loss in (24).
Given a set of data points , it is possible to train a neural network that maps the data space to to learn a signed distance function from the data manifold. The level set then corresponds to the manifold. Reference (22) proposes to achieve this goal by minimizing
(26) |
combining a term that favors the network to be zero on the data with an “Eikonal” term that encourages the gradients to be of unit norm everywhere, weighted by a hyperparameter
. The expectation is taken with respect to some probability distribution over
.This ansatz can be applied to manifold-modeling flows. One approach would be to add a term like (26) for each component of the off-the-manifold latent variables to the existing loss functions,
(27) |
Computing this regularization term then requires the evaluation of the Jacobian , which is plagued by the same computational inefficiency that we discussed for before. Nevertheless, the authors of Reference 26 report learned manifolds of a very high quality even for few training samples, and the computational expense may well be worth it. We leave an exploration of this idea for future work.
Above we discussed training strategies that avoid a computation of the expensive terms in the likelihood. Even with such an efficient training, the model likelihood often needs to be evaluated at test time, although typically not quite as often. Here we collect ideas for how to improve the efficiency of the likelihood evaluation.
While the model likelihood in (14) is tractable, evaluating it for typical flow transformations can be somewhat slow. The cost of this evaluation increases with the dimension of the feature space as well as with the complexity of the network architecture. In our experiments we found that this cost is not the limiting factor when evaluating low- to medium-dimensional feature spaces, even in the context of inference problems that require many repeated evaluations of the likelihood. In this work we thus restricted ourselves to exact likelihood evaluations and did not study the methods described in the following further.
The likelihood in (14) can be computed approximately, for instance with the methods proposed in References (17, 23, 24). Instead of computing the full matrix , these methods just require calculating a number of matrix-vector products with randomly sampled vectors , which can be cheaper. Whether the gains in speed are worth the loss in precision from the approximation remains to be seen; we leave a test of this idea for future work.
The cost of evaluating the Jacobian determinant in (14) can be amortized by evaluating this Jacobian factor for a number of representative data points first, and then regressing on the function . Afterwards, the MFMF likelihood can be evaluated at any point efficiently. We leave an investigation of this idea for future work.
The characterization of the evaluation of the Jacobian in (14) as computationally expensive depends on the architecture of the transformation . In this work we only consider zero-padding followed by typical diffeomorphic transformations like coupling layers with invertible elementwise transformations or permutations; these transformations have evolved over many years of research with the design goal of efficient standard flow densities in mind. It is well possible that a similar amount of research will unveil a class of transformations for which the terms in (14) can be computed efficiently without limiting their expressiveness. We hope that this paper can instigate research into such transformations.
We will now demonstrate manifold-modeling flows in two pedagogical examples. We plan to follow up with experiments focused on more realistic use cases.
A common metric for flow-based models is the model log likelihood evaluated on a test set, but such a comparison is not meaningful in our context. Since the MFMF variants evaluate the likelihood after projecting to the learned manifolds, the data space is different for every model and the likelihoods of different models may not even have the same units. Instead, we analyze the performance through the generative mode, evaluating the quality of samples generated from the models with different metrics depending on the data set. In addition, we use the model likelihood for inference tasks and gauge the quality of the resulting posterior.
First, we want to illustrate the different flow models in a simple toy example. Data is generated on a unit circle in two-dimensional space, where the usual polar angle is drawn from a Gaussian density with mean and standard deviation . To represent a slightly noisy true manifold, the radial coordinate is not exactly set to , but drawn from a Gaussian density with mean and standard deviation . As training data, we generate points in this way.
To highlight the difference between the different models, we purposefully limit the expressivity of the flows by using simple affine coupling layers interspersed with random permutations of the latent variables. For the ambient flow we use ten affine coupling layers, while for PIE and the MFMF variants we restrict to five such layers and model with a Gaussian with learnable mean and variance. We also consider a FOM model, using the known parameterization of the unit circle to model the manifold and a Gaussian with learnable mean and variance for
. Finally, for demonstration purposes we also include an MFMF model that is only trained on reconstruction error, essentially an invertible auto-encoder, in this study and label it MFMF–AE. In all cases, we limit the training to 120 epochs.
Figure 5 shows the true density of the data-generating process (top left) as well as the learned densities from different models (other panels). The standard flow (AF, top middle) learns a smeared-out version of the true density, with a substantial amount of probability mass away from the true manifold. Note that the AF results become much sharper when we train until convergence or switch to a state-of-the-art architecture, as we have tested with rational-quadratic neural spline flows (25). The PIE model (top right) also learns a smeared-out version, but its inductive bias leads to a sharper version than the AF model. We also show the manifold represented by the level set in the PIE model as a dotted black line, it is not in particularly good agreement with the true manifold.
In the bottom panels we show some algorithms with a model density restricted to the manifold, the black space in the figures thus shows the off-the-manifold region which are outside the support of the model. Note that this different support also means that the likelihood values between the top and bottom panels cannot be directly compared. The FOM model (bottom left), which requires knowledge of the manifold, perfectly captures both the shape of the manifold and the density on it. Our new MFMF–M/D algorithm (bottom middle) also parameterizes the density only on the manifold, but now the manifold is learned from data; we see both good manifold quality and good density estimation in the upper half of the circle, where most of the training data lie. In the lower part, where the density was too small to sample enough training data, the learned manifold departs from the true one. Finally, in the bottom right panel we show that training an MFMF model just on reconstruction error can lead to a good approximation of the manifold (where there is training data), but, of course, does not produce a reasonable density on this manifold.
Next, we consider a two-dimensional manifold embedded in defined by
(28) |
Here is a three-dimensional rotation matrix, is a vector of two latent variables that parameterize the manifold, are the coefficients of a polynomial, and is the maximal power in the series. We choose a fixed value for and
for these experiments by a single random sampling from the Haar measure and normal distributions, the values of these parameters are given in the appendix.
We define a conditional probability density on the latent variables as
(29) |
which together with the chart in (28) defines a probability density on the manifold. The dominant component of this mixture model is thus a normal distribution with a large covariance that is independent of the parameter , while only the covariance of the smaller component depends on the parameter , which is restricted to the range .
We train several manifold-modeling flow variants on training samples and compare to AF and PIE baselines. In all cases we use rational-quadratic neural spline flows with ten coupling layers interspersed with random permutations of the features. The setup is described in detail in the appendix.
We visualize the true data manifold and the estimated manifolds from a few MFMF and PIE models in Figure 6. In the top panels we compare the ground truth and three trained models conditional on , in the bottom panels we show how the ground truth and the MFMF–M/D model change for . The manifold defined by in the PIE model is clearly not a good approximation of the true manifold—these directions are only partially aligned with the true data manifold, and the surface defined in this way does not extend near a large part of the true data manifold at all. MFMF–OT gets some of the features of the manifold and density right, but does not perform very well in regions of low density. The results that most closely resemble the true model come from the MFMF–M/D model: not only are the learned manifold and the density of the manifold very similar to the ground truth, but the model accurately captures the dependency of the likelihood on the model parameter .
Model–algorithm | Mean distance from manifold | Mean reconstruction error | Posterior MMD | Out-of-distribution AUC |
---|---|---|---|---|
AF | 0.005 | – | 0.071 | 0.990 |
PIE | 0.006 | 1.253 | 0.075 | 0.972 |
MFMF–S | 0.006 | 0.011 | 0.026 | 0.974 |
MFMF–M/D (alternating) | 0.002 | 0.003 | 0.020 | 0.986 |
MFMF–M/D (sequential) | 0.009 | 0.013 | 0.017 | 0.961 |
MFMF–OT | 0.089 | 0.433 | 0.134 | 0.647 |
MFMF–OT/D (alternating) | 0.142 | 1.121 | 0.051 | 0.584 |
MFMFE–S | 0.005 | 0.006 | 0.033 | 0.975 |
MFMFE–M/D (alternating) | 0.003 | 0.003 | 0.030 | 0.985 |
MFMFE–M/D (sequential) | 0.002 | 0.002 | 0.007 | 0.987 |
In Table 2 we evaluate the performance of the models on four metrics:
We compare the quality of samples generated from the flows by calculating the mean distance from the true data manifold using (28), as described in the appendix.
For all models except the AF we calculate the mean reconstruction error when projecting test samples onto the learned manifold.
We use the flow models for approximate inference on the parameter . We generate posterior samples with an MCMC sampler, using the likelihood of the different flow models in lieu of the true simulator density. The results are compared to posterior samples based on the true simulator likelihood. We summarize the similarity with the maximum mean discrepancy (MMD) of the posterior samples based on a Gaussian kernel (26).
Finally, we evaluate out-of-distribution (OOD) detection. For each model, we compare the distribution of log likelihood and reconstruction error between a normal test sample based on (29) and an OOD sample. The latter is based on the same density as the original model plus Gaussian noise with zero mean and standard deviation of 0.1 on all three features, pushing it off the data manifold of the regular training and test samples. We report the area under the curve (AUC), giving the larger number when both discrimination based on model likelihood and reconstruction error is available.
For each metric, we report the median based on five runs with independent training samples and weight initializations.
In all metrics except the out-of-distribution detection, manifold-modeling flows provide the best results. In particular, samples generated from the MFMF–M/D and MFMFE–M/D models are closest to the true data manifold and most faithfully reconstruct test samples after projecting them to the learned manifold. These algorithms, which perform comparably within the variance between the runs, clearly outperform the AF and PIE baselines when it comes to inference on . The reconstruction error returned by these manifold-modeling flows is not quite as good as the AF log likelihood when it comes to out-of-distribution detection. The other training algorithms all have their shortcomings: MFMF–S training is not only slower, but also leads to slightly worse results, and the optimal transport variants MFMF–OT and MFMF–OT/D do not perform well on any metric, perhaps signalling the need for a more thorough tuning of hyperparameters.
Our work is closely related to a number of different probabilistic and generative models. We have discussed the relation to normalizing flows, autoencoders, variational autoencoders, generative adversarial networks, and energy-based models in the introduction and in Section 2. In addition, manifold learning is its own research field with a rich set of methods (27), though these typically do not model the data density on the manifold and thus do not serve quite the same purpose as the models discussed in this paper. In the following we want to draw attention to three particularly closely related works and describe how our approach differs from them.
The work most closely related to manifold-modeling flows are relaxed injective probability flows (8), which appeared while this paper was in its final stages of preparation. The proposed model is similar to our manifold-modeling flows with a separate encoder (MFMFE). A key difference is the way in which the invertibility of the decoder is enforced. The authors of Reference (8) bound the norm of the Jacobian of an otherwise unrestricted transformation . While this makes the transformation in principle invertible (up to the possibility of multiple points in latent space pointing to the same point in data space), the inverse of and the likelihood of this model are not tractable for unseen data points. This makes their algorithm unsuitable for inference tasks. As the authors point out, their model also cannot deal with points off the learned manifold. We address these issues by drawing from the flow literature and defining the decoder as the level set of a diffeomorphism, which is by construction exactly invertible. We also add a prescription for evaluating off-the-manifold points with a projection to the manifold, which naturally provides a measure of distance between the data point and the manifold.
Similar to our discussion in Section 3.1, the authors of Reference (8) also argue that training an injective flow by maximum likelihood is infeasible due to the computational cost of evaluating the Jacobian of . They propose a different training objective that is based on a stochastic approximation of a lower bound of the likelihood, which can be computed efficiently. We point to this training strategy in our discussion in Section 3.2, but realized that the alternating procedure allows us to sidestep the problem. Finally, their motivation is different from ours: while we develop MFMFs specifically to better represent the true structure of the data, they focus on the reduced computational complexity of the model due to a lower-dimensional space; they view the lack of support of the model off the manifold as a deficiency rather than an advantage. In addition to these qualitative differences, it would be interesting to compare relaxed injective probability flows and manifold-modeling flows quantitatively.
Another closely related model is the pseudo-invertible autoencoder (PIE) (19), which we define and discuss in Section 2 and use as a baseline in our experiments. The key difference to our MFMF setup is that the PIE model describes a density over the ambient data space, while MFMF limits the density strictly to the manifold. In this sense the PIE approach is much more similar to a standard ambient flow, though it adds a multi-scale architecture and different base densities for the latent variables that correspond to the manifold coordinates and the off-the-manifold latents. In addition to this fundamental difference in construction, PIE and MFMF models are trained differently: for PIE maximum likelihood is sufficient, while for MFMF we discuss the shortcomings of that objective and propose several new training schemes.
Finally, the MFMF is closely related to normalizing flows on (prescribed) manifolds (6) (FOM). In particular, the likelihood equation is almost the same, with the crucial exception that manifold flows require knowing a parameterization of the manifold in terms of coordinate and a chart, while the MFMF algorithm learnes these from data. Since in many real-world cases the manifold is not known, MFMF models are applicable to a much larger class of problems than FOM.
This paper contains four main contributions:
We propose manifold-modeling flows (MFMF and MFMFE).
We identify a subtlety in the naive interpretation of the density of such models and argue that they should not be trained by naive maximum likelihood alone. We address this issue with the new manifold / density (M/D) training strategy, which separates manifold and density updates. This both reduces the computational cost of the likelihood evaluation during training as well as avoids issues with potential pathological configurations. We also discuss training strategies based on adversarial training and optimal transport.
We demonstrate these models and training algorithms in two pedagogical examples.
Beyond the newly proposed algorithms we provide a general discussion of the relation between different generative models and the data manifold, reviewing ambient flows, injective flows, flows on manifolds, PIEs, VAEs, and GANs in a common language. In particular, we identify an inconsistency between training and data generation for PIE models.
In this work we introduced manifold-modeling flows (MFMFs), a new type of generative model that combines aspects of normalizing flows, autoencoders, and GANs. MFMFs describe data as a probability density over a lower-dimensional manifold embedded in data space. Unlike flows on prescribed manifolds, they learn a chart for the manifold from the training data. MFMFs allow generating data in a similar way to GANs while maintaining a tractable exact density. They also provide a prescription for evaluating points off the manifold by first projecting data onto the manifold. The MFMF approach may not only represent data sets with manifold structure more accurately, but also allow us to use lower-dimensional latent spaces than with ambient flows, reducing the memory and computational footprint. As an added benefit, the projection to the manifold may be useful for denoising or to detect out-of-distribution samples. We introduced two variants of this new model, one of which features a separate encoder while the other uses the inverse of the decoder directly, and broadly reviewed the relation between several types of generative models and the structure of the data manifold.
Despite the tractable density, training MFMF models is nontrivial: any update of the manifold modifies the data variable that the density is describing, rendering training by naive maximum likelihood invalid. In addition, computing the full model likelihood can be expensive. We reviewed several training and evaluation strategies that mitigate this problem. In particular, we introduced the new M/D training schedule, which separates manifold and density updates and solves both stability and training issues. We also presented an adversarial training scheme based on optimal transport as well as a hybrid version that alternates between adversarial phases and density updates.
In two pedagogical experiments, we demonstrated how this approach lets us learn the data manifold and a probability density on it. MFMF models allowed us to reconstruct the data manifold with a substantially higher quality than PIE and outperformed ambient flow and PIE baselines on a downstream inference task. Our experiments were so far limited to problems with a low dimensionality. While training manifold-modeling flows on high-dimensional data sets such as high-resolution images is straightforward and the generative mode remains efficient, inference tasks will become more challenging as the exact likelihood evaluation becomes increasingly expensive. In this paper we have laid out multiple strategies that can help mitigate this cost.
Problems in which data populates a lower-dimensional manifold embedded in a high-dimensional feature space are almost everywhere. In some scientific cases, domain knowledge allows for exact statements about the dimensionality of the data manifold, and MFMFs can be a particularly powerful tool in a likelihood-free or simulation-based inference setting (28). Even in the absence of such domain-specific insight this approach may be valuable: GANs with low-dimensional latent spaces are powerful generative models for numerous data sets of natural images, which is testament to the presence of a low-dimensional data manifold. Flows that simultaneously learn the data manifold and a tractable density over it may help us to unify generative and inference tasks in a way that is well-suited to the structure of the data.
We would like to thank Jens Behrmann, Jean Feydy, Michael Kagan, George Papamakarios, Merle Reinhart, Frank Rösler, John Tamanas, and Andrew Wilson for useful discussions. We are grateful to Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios for publishing their excellent neural spline flow codebase (25), which we used extensively in our analysis. Similarly, we want to thank George Papamakarios, David Sterratt, and Iain Murray for publishing their Sequential Neural Likelihood code (29), parts of which were used in the evaluation steps in our experiments. We are grateful to the authors and maintainers of Delphes 3 (30), GeomLoss (21), Jupyter (31), MadMiner (32), Matplotlib (33), NumPy (34), Pythia8 (35), PyTorch (36), scikit-learn (37), and SciPy (38)
. This work was supported by the National Science Foundation under the awards ACI-1450310, OAC-1836650, and OAC-1841471; by the Moore-Sloan data science environment at NYU; and through the NYU IT High Performance Computing resources, services, and staff expertise.
The 22nd International Conference on Artificial Intelligence and Statistics
. pp. 2681–2690.Journal of Machine Learning Research
13(Mar):723–773.Jones E, Oliphant T, Peterson P, Others (2001) {SciPy}: Open source scientific tools for {Python}.
In our second experiment, the manifold is defined by (28). We use the randomly drawn polynomial coefficients
(30) |
and the rotation matrix
(31) |
For the training data set we draw parameter points from a uniform prior, , while for the test set we generate data for .
We implement all generative models as rational-quadratic neural spline flows with coupling layers alternating with random permutations (25). For standard flows we use ten coupling layers, for PIE, MFMF, and MFMFE models five layers for the transformation (which also defines through a level set), and five layers for the . For the PIE model we use an off-the-manifold base density with standard deviation . In each coupling transform, half of the inputs are elementwise transformed with a monotonic rational-quadratic spline, the parameters of which are determined from a residual network with two residual block of two hidden layers each, 100 units in each layer, and
activations throughout. We do not use batch normalization or dropout since we found that the stochasticity they induce can lead to issues with the invertibility of the transformations. The splines are constructed in ten bins of each variable, distributed over the range
.All models are trained with the Adam optimizer, with an initial learning rate of and cosine annealing, and weight decay of . To balance the sizes of the various terms in the loss functions, we multiply them with different weights. For the manifold phase of the M/D training, we weight the mean reconstruction error with a factor . In the S training we use the the mean negative log likelihood weighted with a factor of plus the mean reconstruction error weighted with a factor of . For OT training we multiply the Sinkhorn divergence (defined with ) with 10. We train for 50 epochs with a batch size of 100 (1000 for the OT training). We study sequential as well as alternating versions of the M/D algorithm, where in the latter case we alternate between training phases after every epoch. We save the weights after each epoch and use the set of weights that leads to the smallest validation loss.
We evaluate generated samples by undoing the rotation, , and evaluating the distance in direction to the manifold as . For the inference task we use a Metropolis-Hastings MCMC sampler based on the different flow likelihoods. We consider a synthetic “observed” data set of 10 i. i. d. samples generated for . For each model, we generate an MCMC chain of length 5000, with a Gaussian proposal distribution with mean step size 0.15 and a burn in of 100 steps.