Analyzing Inverse Problems with Invertible Neural Networks

08/14/2018 ∙ by Lynton Ardizzone, et al. ∙ University of Heidelberg 20

In many tasks, in particular in natural science, the goal is to determine hidden system parameters from a set of measurements. Often, the forward process from parameter- to measurement-space is a well-defined function, whereas the inverse problem is ambiguous: one measurement may map to multiple different sets of parameters. In this setting, the posterior parameter distribution, conditioned on an input measurement, has to be determined. We argue that a particular class of neural networks is well suited for this task -- so-called Invertible Neural Networks (INNs). Although INNs are not new, they have, so far, received little attention in literature. While classical neural networks attempt to solve the ambiguous inverse problem directly, INNs are able to learn it jointly with the well-defined forward process, using additional latent output variables to capture the information otherwise lost. Given a specific measurement and sampled latent variables, the inverse pass of the INN provides a full distribution over parameter space. We verify experimentally, on artificial data and real-world problems from astrophysics and medicine, that INNs are a powerful analysis tool to find multi-modalities in parameter space, to uncover parameter correlations, and to identify unrecoverable parameters.



There are no comments yet.


page 6

page 8

page 9

Code Repositories


Code for the paper "Analyzing inverse problems with invertible neural networks." (2018)

view repo


Normalizing-flow Invertible Neural Networks (TensorFlow2+Keras)

view repo


Probabilistic Graphical Model Construction

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When analyzing complex physical systems, a common problem is that the system parameters of interest cannot be measured directly. For many of these systems, scientists have developed sophisticated theories on how measurable quantities arise from the hidden parameters . We will call such mappings the forward process. However, the inverse process is required to infer the hidden states of a system from measurements. Unfortunately, the inverse is often both intractable and ill-posed, since crucial information is lost in the forward process.

To fully assess the diversity of possible inverse solutions for a given measurement, an inverse solver should be able to estimate the

complete posterior of the parameters, conditioned on an observation. This makes it possible to quantify uncertainty, reveal multi-modal distributions, and identify degenerate and unrecoverable parameters – all highly relevant for applications in natural science. In this paper, we ask if invertible neural networks (INNs) are a suitable model class for this task. INNs are characterized by three properties:

  1. The mapping from inputs to outputs is bijective, i.e. its inverse exists,

  2. both forward and inverse mapping are efficiently computable, and

  3. both mappings have a tractable Jacobian, which allows explicit computation of posterior probabilities.

Networks that are invertible by construction offer a unique opportunity: We can train them on the well-understood forward process and get the inverse for free by running them backwards at prediction time. To counteract the inherent information loss of the forward process, we introduce additional latent output variables , which capture the information about that is not contained in . Thus, our INN learns to associate hidden parameter values with unique pairs of measurements and latent variables. Forward training optimizes the mapping and implicitly determines its inverse . Additionally, we make sure that the density

of the latent variables is shaped as a Gaussian distribution. Thus, the INN represents the desired posterior

by a deterministic function that transforms (“pushes”) the known distribution to -space, conditional on .

Compared to standard approaches (see Fig. 1, left), INNs circumvent a fundamental difficulty of learning inverse problems: Defining a sensible supervised loss for direct posterior learning is problematic since it requires prior knowledge about that posterior’s behavior, constituting a kind of hen-end-egg problem. If the loss does not match the possibly complicated (e.g. multi-modal) shape of the posterior, learning will converge to incorrect or misleading solutions. Since the forward process is usually much simpler and better understood, forward training diminishes this difficulty. Specifically, we make the following contributions:

  • [leftmargin=2em, rightmargin=1em]

  • We show that the full posterior of an inverse problem can be estimated with invertible networks, both theoretically in the asymptotic limit of zero loss, and practically on synthetic and real-world data from astrophysics and medicine.

  • The architectural restrictions imposed by invertibility do not seem to have detrimental effects on our network’s representational power.

  • While forward training is sufficient in the asymptotic limit, we find that a combination with unsupervised backward training improves results on finite training sets.

  • In our applications, our approach to learning the posterior compares favourably to approximate Bayesian computation (ABC) and conditional VAEs. This enables identifying unrecoverable parameters, parameter correlations and multimodalities.

Invertible Neural Network

Standard (Bayesian) Neural Network


forward (simulation):

inverse (sampling):




(Bayesian) NN

forward (simulation):

inverse (prediction):

Figure 1: Abstract comparison of standard approach (left) and ours (right). The standard direct approach requires a discriminative, supervised loss (SL) term between predicted and true , causing problems when is ambiguous. Our network uses a supervised loss only for the well-defined forward process . Generated are required to follow the prior by an unsupervised loss (USL), while the latent variables are made to follow a Gaussian distribution, also by an unsupervised loss. See details in Section 3.3.

2 Related work

Modeling the conditional posterior of an inverse process is a classical statistical task that can in principle be solved by Bayesian methods. Unfortunately, exact Bayesian treatment of real-world problems is usually intractable. The most common (but expensive) solution is to resort to sampling, typically by a variant of Markov Chain Monte Carlo

(Robert and Casella, 2004; Gamerman and Lopes, 2006). If a model for the forward process is available, approximate Bayesian computation (ABC) is often preferred, which embeds the forward model in a rejection sampling scheme for the posterior (Sunnåker et al., 2013; Lintusaari et al., 2017; Wilkinson, 2013).

Variational methods offer a more efficient alternative, approximating the posterior by an optimally chosen member of a tractable distribution family (Blei et al., 2017). Neural networks can be trained to predict accurate sufficient statistics for parametric posteriors (Papamakarios and Murray, 2016; Siddharth et al., 2017), or can be designed to learn a mean-field distribution for the network’s weights via dropout variational inference (Gal and Ghahramani, 2015; Kingma et al., 2015). Both ideas can be combined (Kendall and Gal, 2017) to differentiate between data-related and model-related uncertainty. However, the restriction to limited distribution families fails if the true distribution is too complex (e.g. when it requires multiple modes to represent ambiguous or degenerate solutions) and essentially counters the ability of neural networks to act as universal approximators. Conditional GANs (cGANs; Mirza and Osindero, 2014; Isola et al., 2017) overcome this restriction in principle, but often lack satisfactory diversity in practice (Zhu et al., 2017b)

. For our tasks, conditional variational autoencoders

(cVAEs; Sohn et al., 2015) perform better than cGANs, and are also conceptually closer to our approach (see appendix Sec. 2), and hence serve as a baseline in our experiments.

Generative modeling via learning of a non-linear transformation between the data distribution and a simple prior distribution

(Deco and Brauer, 1995; Hyvärinen and Pajunen, 1999) has the potential to solve these problems. Today, this approach is often formulated as a normalizing flow (Tabak et al., 2010; Tabak and Turner, 2013), which gradually transforms a normal density into the desired data density and relies on bijectivity to ensure the mapping’s validity. These ideas were applied to neural networks by Deco and Brauer (1995); Rippel and Adams (2013); Rezende and Mohamed (2015) and refined by Tomczak and Welling (2016); Berg et al. (2018); Trippe and Turner (2018). Today, the most common realizations use auto-regressive flows

, where the density is decomposed according to the Bayesian chain rule

(Kingma et al., 2016; Huang et al., 2018; Germain et al., 2015; Papamakarios et al., 2017; Oord et al., 2016; Kolesnikov and Lampert, 2017; Salimans et al., 2017; Uria et al., 2016). These networks successfully learned unconditional generative distributions for artificial data and standard image sets (e.g. MNIST, CelebA, LSUN bedrooms), and some encouraging results for conditional modeling exist as well (Oord et al., 2016; Salimans et al., 2017; Papamakarios et al., 2017; Uria et al., 2016).

These normalizing flows possess property (i) of an INN, and are usually designed to fulfill requirement (iii) as well. In other words, flow-based networks are invertible in principle, but the actual computation of their inverse is too costly to be practical, i.e. INN property (ii) is not fulfilled. This precludes the possibility of bi-directional or cyclic training, which has been shown to be very beneficial in generative adversarial nets and auto-encoders (Zhu et al., 2017a; Dumoulin et al., 2016; Donahue et al., 2017; Teng et al., 2018). In fact, optimization for cycle consistency forces such models to converge to invertible architectures, making fully invertible networks a natural choice. True INNs can be built using coupling layers, as introduced in the NICE (Dinh et al., 2014) and RealNVP (Dinh et al., 2016) architectures. Despite their simple design and training, these networks were rarely studied: Gomez et al. (2017) used a NICE-like design as a memory-efficient alternative to residual networks, Jacobsen et al. (2018) demonstrated that the lack of information reduction from input to representation does not cause overfitting, and Schirrmeister et al. (2018) trained such a network as an adverserial autoencoder. Danihelka et al. (2017) showed that minimization of an adversarial loss is superior to maximum likelihood training in RealNVPs, whereas the Flow-GAN of Grover et al. (2017) performs even better using bidirectional training, a combination of maximum likelihood and adverserial loss. The Glow architecture by Kingma and Dhariwal (2018) incorporates invertible 1x1 convolutions into RealNVPs to achieve impressive image manipulations. This line of research inspired us to extend RealNVPs for the task of computing posteriors in real-world inverse problems from natural and life sciences.

3 Methods

3.1 Problem specification

We consider a common scenario in natural and life sciences: Researchers are interested in a set of variables describing some phenomenon of interest, but only variables can actually be observed, for which the theory of the respective research field provides a model for the forward process. Since the transformation from to incurs an information loss, the intrinsic dimension of is in general smaller than , even if the nominal dimensions satisfy . Hence we want to express the inverse model as a conditional probability , but its mathematical derivation from the forward model is intractable in the applications we are going to address.

We aim at approximating by a tractable model , taking advantage of the possibility to create an arbitrary amount of training data from the known forward model and a suitable prior

. While this would allow for training of a standard regression model, we want to approximate the full posterior probability. To this end, we introduce a latent random variable

drawn from a multi-variate standard normal distribution and reparametrize

in terms of a deterministic function of and , represented by a neural network with parameters :


Note that we distinguish between hidden parameters representing unobservable real-world properties and latent variables carrying information intrinsic to our model. Choosing a Gaussian prior for

poses no additional limitation, as proven by the theory of non-linear independent component analysis

(Hyvärinen and Pajunen, 1999).

In contrast to standard methodology, we propose to learn the model of the inverse process jointly with a model approximating the known forward process :


Functions and share the same parameters and are implemented by a single invertible neural network. Our experiments show that joint bi-directional training of and avoids many complications arising in e.g. cVAEs or Bayesian neural networks, which have to learn the forward process implicitly.

The relation is enforced by the invertible network architecture, provided that the nominal and intrinsic dimensions of both sides match. When denotes the intrinsic dimension of , the latent variable must have dimension , assuming that the intrinsic dimension of equals its nominal dimension . If the resulting nominal output dimension exceeds

, we augment the input with a vector

of zeros and replace with the concatenation everywhere. Combining these definitions, our network expresses as


with Jacobian determinant . When using coupling layers, according to Dinh et al. (2016), computation of is simple, as each transformation has a triangular Jacobian matrix.

3.2 Invertible architecture

To create a fully invertible neural network, we follow the architecture proposed by Dinh et al. (2016): The basic unit of this network is a reversible block consisting of two complementary affine coupling layers. Hereby, the block’s input vector is split into two halves, and , which are transformed by an affine function with coefficients and (), using element-wise multiplication () and addition:


Given the output , these expressions are trivially invertible:


Importantly, the mappings and can be arbitrarily complicated functions of and and need not

themselves be invertible. In our implementation, they are realized by a succession of several fully connected layers with leaky ReLU activations.

A deep invertible network is composed of a sequence of these reversible blocks. To increase model capacity, we apply a few simple extensions to this basic architecture. Firstly, if the dimension

is small, but a complex transformation has to be learned, we find it advantageous to pad both the in- and output of the network with an equal number of zeros. This does not change the intrinsic dimensions of in- and output, but enables the network’s interior layers to embed the data into a larger representation space in a more flexible manner. Secondly, we insert permutation layers between reversible blocks, which shuffle the elements of the subsequent layer’s input in a randomized, but fixed, way. This causes the splits

to vary between layers and enhances interaction among the individual variables. Kingma and Dhariwal (2018) use a similar architecture with learned permutations.

3.3 Bi-directional training

Invertible networks offer the opportunity to simultaneously optimize for losses on both the in- and output domains (Grover et al., 2017), which allows for more effective training. Hereby, we perform forward and backward iterations in an alternating fashion, accumulating gradients from both directions before performing a parameter update. For the forward iteration, we penalize deviations between simulation outcomes and network predictions with a loss . Depending on the problem, can be any supervised loss, e.g. squared loss for regression or cross-entropy for classification.

The loss for latent variables penalizes the mismatch between the joint distribution of network outputs

and the product of marginal distributions of simulation outcomes and latents as .

We block the gradients of with respect to to ensure the resulting updates only affect the predictions of and do not worsen the predictions of . Thus, enforces two things: firstly, the generated must follow the desired normal distribution ; secondly, and must be independent upon convergence (i.e. ), and not encode the same information twice. As is implemented by Maximum Mean Discrepancy (Sec. 3.4), which only requires samples from the distributions to be compared, the Jacobian determinants and do not have to be known explicitly. In appendix Sec. 1, we prove the following theorem:

Theorem: If an INN is trained as proposed, and both the supervised loss and the unsupervised loss reach zero, sampling according to Eq. 1 with returns the true posterior for any measurement .

Although and are sufficient asymptotically, a small amount of residual dependency between and remains after a finite amount of training. This causes to deviate from the true posterior . To speed up convergence, we also define a loss on the input side, implemented again by MMD. It matches the distribution of backward predictions against the prior data distribution through . In the appendix, Sec. 1, we prove that is guaranteed to be zero when the forward losses and have converged to zero. Thus, incorporating does not alter the optimum, but improves convergence in practice.

Finally, if we use padding on either network side, loss terms are needed to ensure no information is encoded in the additional dimensions. We a) use a squared loss to keep those values close to zero and b) in an additional inverse training pass, overwrite the padding dimensions with noise of the same amplitude and minimize a reconstruction loss, which forces these dimensions to be ignored.

3.4 Maximum mean discrepancy

Maximum Mean Discrepancy (MMD) is a kernel-based method for comparison of two probability distributions that are only accessible through samples

(Gretton et al., 2012). While a trainable discriminator loss is often preferred for this task in high-dimensional problems, especially in GAN-based image generation, MMD also works well, is easier to use and much cheaper, and leads to more stable training (Tolstikhin et al., 2017)

. The method requires a kernel function as a design parameter, and we found that kernels with heavier tails than Gaussian are needed to get meaningful gradients for outliers. We achieved best results with the Inverse Multiquadratic

, reconfirming the suggestion from Tolstikhin et al. (2017). Since the magnitude of the MMD depends on the kernel choice, the relative weights of the losses , ,

are adjusted as hyperparameters, such that their effect is about equal.

4 Experiments

We first demonstrate the capabilities of INNs on two well-behaved synthetic problems and then show results for two real-world applications from the fields of medicine and astrophysics. Additional details on the datasets and network architectures are provided in the appendix.

4.1 Artificial data

Gaussian mixture model:

To test basic viability of INNs for inverse problems, we train them on a standard 8-component Gaussian mixture model

. The forward process is very simple: The first four mixture components (clockwise) are assigned label , the next two get label , and the final two are labeled and (Fig. 2). The true inverse posteriors

consist of the mixture components corresponding to the given one-hot-encoded label

. We train the INN to directly regress one-hot vectors using a squared loss , so that we can provide plain one-hot vectors to the inverse network when sampling . We observe the following: (i) The INN learns very accurate approximations of the posteriors and does not suffer from mode collapse. (ii) The coupling block architecture does not reduce the network’s representational power – results are similar to standard networks of comparable size (see appendix Sec. 2). (iii) Bidirectional training works best, whereas forward training alone (using only and ) captures the conditional relationships properly, but places too much mass in unpopulated regions of -space. Conversely, pure inverse training (just ) learns the correct -distribution, but loses all conditioning information.

Figure 2: Viability of INN for a basic inverse problem. The task is to produce the correct (multi-modal) distribution of 2D points , given only the color label . When trained with all loss terms from Sec. 3.3, the INN output matches ground truth almost exactly (2nd image). The ablations (3rd and 4th image) show that we need and to learn the conditioning correctly, whereas helps us remain faithful to the prior.

Inverse kinematics:

For a task with a more complex and continuous forward process, we simulate a simple inverse kinematics problem in 2D space: An articulated arm moves vertically along a rail and rotates at three joints. These four degrees of freedom constitute the parameters

. Their priors are given by a normal distribution, which favors a pose with angles and centered origin. The forward process is to calculate the coordinates of the end point , given a configuration . The inverse problem asks for the posterior distribution over all possible inputs that place the arm’s end point at a given position. An example for a fixed is shown in Fig. 3, where we compare our INN to a conditional VAE (see appendix Fig. 7 for conceptual comparison of architectures). Adding Inverse Autoregressive Flow (IAF, Kingma et al., 2016) does not improve cVAE performance in this case (see appendix, Table 2). The chosen in Fig. 3 is a hard example, as it is unlikely under the prior (Fig. 3, right) and has a strongly bi-modal posterior .

In this case, due to the computationally cheap forward process, we can use approximate Bayesian computation (ABC, see appendix Sec. 7) to sample from the ground truth posterior. Compared to ground truth, we find that both INN and cVAE recover the two symmetric modes well. However, the true end points of -samples produced by the cVAE tend to miss the target by a wider margin. This is because the forward process is only learned implicitly during cVAE training. See appendix for quantitative analysis and details.

Figure 3: Distribution over articulated poses , conditioned on the end point . The desired end point is marked by a gray cross. A dotted line on the left represents the rail the arm is based on, and the faint colored lines indicate sampled arm configurations taken from the true (ABC) or learned (INN, cVAE) posterior . The prior (right) is shown for reference. The actual end point of each sample may deviate slightly from the target ; contour lines enclose the regions containing 97% of these end points. We emphasize the articulated arm with the highest estimated likelihood for illustrative purposes.

4.2 Real-world applications

After demonstrating the viability on synthetic data, we apply our method to two real world problems from medicine and astronomy. While we focus on the medical task in the following, the astronomy application is shown in Fig. 5.

In medical science, the functional state of biological tissue is of interest for many applications. Tumors, for example, are expected to show changes in oxygen saturation (Hanahan and Weinberg, 2011). Such changes cannot be measured directly, but influence the reflectance of the tissue, which can be measured by multispectral cameras (Lu and Fei, 2014). Since ground truth data can not be obtained from living tissue, we create training data by simulating observed spectra from a tissue model involving , blood volume fraction , scattering magnitude , anisotropy and tissue layer thickness (Wirkert et al., 2016). This model constitutes the forward process, and traditional methods to learn point estimates of the inverse (Wirkert et al., 2016, 2017; Claridge and Hidovic-Rowe, 2013) are already sufficiently reliable to be used in clinical trials. However, these methods can not adequately express uncertainty and ambiguity, which may be vital for an accurate diagnosis.

Competitors.  We train an INN for this problem, along with two ablations (as in Fig. 2), as well as a cVAE with and without IAF (Kingma et al., 2016) and a network using the method of Kendall and Gal (2017), with dropout sampling and additional aleatoric error terms for each parameter. The latter also provides a point-estimate baseline (classical NN) when used without dropout and error terms, which matches the current state-of-the-art results in Wirkert et al. (2017). Finally, we compare to ABC, approximating with the 256 samples closest to . Note that with enough samples ABC would produce the true posterior. We performed simulations to generate samples for ABC at test time, taking one week on a GPU, but still measure inconsistencies in the posteriors. The learning-based methods are trained within minutes, on a training set of samples generated offline.

Error measures.  We are interested in both the accuracy (point estimates), and the shape of the posterior distributions. For point estimates , i.e. MAP estimates, we compute the deviation from ground-truth values in terms of the RMSE over test set observations , . The scores are reported both for the main parameter of interest , and the parameter subspace of , , , which we found to be the only recoverable parameters. Furthermore, we check the re-simulation error: We apply the simulation to the point estimate, and compare the simulation outcome to the conditioning . To evaluate the shape of the posteriors, we compute the calibration error for the sampling-based methods, based on the fraction of ground truth inliers for corresponding -confidence-region of the marginal posteriors of . The reported error is the median of over all . All values are computed over 5000 test-set observations , or 1000 observations in the case of re-simulation error. Each posterior uses 4096 samples, or 256 for ABC; all MAP estimates are found using the mean-shift algorithm.

Quantitative results.  Evaluation results for all methods are presented in Table 1. The INN matches or outperforms other methods in terms of point estimate error. Its accuracy deteriorates slightly when trained without , and entirely when trained without the conditioning losses and , just as in Fig. 2. For our purpose, the calibration error is the most important metric, as it summarizes the correctness of the whole posterior distribution in one number (see appendix Fig. 11). Here, the INN has a big lead over cVAE(-IAF) and Dropout, and even over ABC due to the low ABC sample count.

Method MAP error MAP error all MAP re-simulation error Calibration error
NN (+ Dropout) 0.057 0.003 0.56 0.01 0.397 0.008 1.91%
INN 0.041 0.002 0.57 0.02 0.327 0.007 0.34%
INN, only 0.066 0.003 0.71 0.02 0.506 0.010 1.62%
INN, only 0.861 0.033 1.70 0.02 2.281 0.045 3.20%
cVAE 0.050 0.002 0.74 0.02 0.314 0.007 2.19%
cVAE-IAF 0.050 0.002 0.74 0.03 0.313 0.008 1.40%
ABC 0.036 0.001 0.54 0.02 0.284 0.005 0.90%
Simulation noise 0.129 0.001
Table 1: Quantitative results in medical application. We measure the accuracy of point/MAP estimates as detailed in Sec. 4.2. Best results within measurement error are bold, and we determine uncertainties () by statistical bootstrapping. The parameter is the most relevant in this application, whereas error all means all recoverable parameters (, and ). Re-simulation error measures how well the MAP estimate is conditioned on the observation . Calibration error is the most important, as it summarizes correctness of the posterior shape in one number; see appendix Fig. 11 for more calibration results.
Figure 4: Sampled posterior of 5 parameters for fixed in medical application. For a fixed observation , we compare the estimated posteriors of different methods. The bottom row also includes the point estimate (dashed green line). Ground truth values (dashed black line) and prior over all data (gray area) are provided for reference.

Qualitative results.  Fig. 4 shows generated parameter distributions for one fixed measurement , comparing the INN to cVAE-IAF, Dropout sampling and ABC. The three former methods use a sample count of

to produce smooth curves. Due to the sparse posteriors of 256 samples in the case of ABC, kernel density estimation was applied to its results, with a bandwidth of

. The results produced by the INN provide relevant insights: First, we find that the posteriors for layer thickness and anisotropy match the shape of their priors, i.e.  holds no information about these parameters – they are unrecoverable. This finding is supported by the ABC results, whereas the other two methods misleadingly suggest a roughly Gaussian posterior. Second, we find that the sampled distributions for the blood volume fraction and scattering amplitude are strongly correlated (rightmost plot). This phenomenon is not an analysis artifact, but has a sound physical explanation: As blood volume fraction increases, more light is absorbed inside the tissue. For the sensor to record the same intensities as before, scattering must be increased accordingly. In Fig. 10 in the appendix, we show how the INN is applied to real multispectral images.

Figure 5: Astrophysics application. Properties of star clusters in interstellar gas clouds are inferred from multispectral measurements . We train an INN on simulated data, and show the sampled posterior of 5 parameters for one (colors as in Fig. 4, second row). The peculiar shape of the prior is due to the dynamic nature of these simulations. We include this application as a real-world example for the INN’s ability to recover multiple posterior modes, and strong correlations in , see details in appendix, Sec. 5.

5 Conclusion

We have shown that the full posterior of an inverse problem can be estimated with invertible networks, both theoretically and practically on problems from medicine and astrophysics. We share the excitement of the application experts to develop INNs as a generic tool, helping them to better interpret their data and models, and to improve experimental setups. As a side effect, our results confirm the findings of others that the restriction to coupling layers does not noticeably reduce the expressive power of the network.

In summary, we see the following fundamental advantages of our INN-based method compared to alternative approaches: Firstly, one can learn the forward process and obtain the (more complicated) inverse process ‘for free’, as opposed to e.g. cGANs, which focus on the inverse and learn the forward process only implicitly. Secondly, the learned posteriors are not restricted to a particular parametric form, in contrast to classical variational methods. Lastly, in comparison to ABC and related Bayesian methods, the generation of the INN posteriors is computationally very cheap. In future work, we plan to systematically analyze the properties of different invertible architectures, as well as more flexible models utilizing cycle losses, in the context of representative inverse problem. We are also interested in how our method can be scaled up to higher dimensionalities, where MMD becomes less effective.


LA received funding by the Federal Ministry of Education and Research of Germany, project ‘High Performance Deep Learning Framework’ (No 01IH17002). JK, CR and UK received financial support from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program (grant agreement No 647769). SW and LMH received funding from the European Research Council (ERC) starting grant COMBIOSCOPY (637960). EWP, DR, and RSK acknowledge support by Collaborative Research Centre (SFB 881) ‘The Milky Way System’ (subprojects B1, B2 and B8), the Priority Program SPP 1573 ‘Physics of the Interstellar Medium’ (grant numbers KL 1358/18.1, KL 1358/19.2 and GL 668/2-1) and the European Research Council in the ERC Advanced Grant STARLIGHT (project no. 339177)


1 Proof of correctness of generated posteriors


If some bijective function transforms a probability density to , then the inverse function transforms back to .


We denote the probability density obtained through the reverse transformation as . Therefore, we have to show that . For the forward direction, via the change-of-variables formula, we have


with the Jacobian . For the reverse transformation, we have


We can substitute from Eq. 6 and obtain


In Eq. 9, the Jacobians cancel out due to the inverse function theorem, i.e. the Jacobian is the matrix inverse of .


If an INN is trained as proposed, and both the supervised loss and the unsupervised loss reach zero, sampling according to Eq. 1 with returns the true posterior for any measurement .


We denote the chosen latent distribution as , the distribution of observations as , and the joint distribution of network outputs as . As shown by Gretton et al. [2012], if the MMD loss converges to 0, the network outputs follow the prescribed distribution:


Suppose we take a posterior conditioned on a fixed , i.e. , and transform it using the forward pass of our perfectly converged INN. From this we obtain an output distribution . Because , we know that the output distribution of (marginalized over ) must be . Also, because of the independence between and in the output, the distribution of -outputs is still . So the joint distribution of outputs is


When we invert the network, and repeatedly input while sampling , this is the same as sampling from the above. Using the Lemma from above, we know that the inverted network will output samples from .


If the conditions of the theorem above are fulfilled, the unsupervised reverse loss between the marginalized outputs of the inverted network, , and the prior data distribution, , will also be 0. This justifies using the loss on the prior to speed up convergence, without altering the final results.


Due to the theorem, the estimated posteriors generated by the INN are correct, i.e. . If they are marginalized over observations from the training data, then will be equal to by definition. As shown by Gretton et al. [2012], this is equivalent to .

2 Artificial data – Gaussian mixture

In Sec. 4.1, we demonstrate that the proposed INN can approximate the true posteriors very well and is not hindered by the required coupling block architecture. Here we show how some existing methods do on the same task, using neural networks of similar size as the INN.

Figure 6: Results of several existing methods for the Gaussian mixture toy example.

Training a conditional GAN of network size comparable to the INN (counting only the generator) and only two noise dimensions turned out to be challenging. Even with additional pre-training to avoid mode collapse, the individual modes belonging to one label are reduced to nearly one-dimensional structures.

Larger cGAN

In order to match the results of the INN, we trained a more complex cGAN with 2M parameters instead of the previous 10K, and a latent dimension of 128, instead of 2. To prevent mode collapse, we introduced an additional regularization: an extra loss term forces the variance of generator outputs to match the variance of the training data prior. With these changes, the cGAN can be seen to recover the posteriors reasonably well.

Generator + MMD

Another option is to keep the cGAN generator the same size as our INN, but replace the discriminator with an MMD loss (cf. Sec. 3.4). This loss receives a concatenation of the generator output and the label it was supplied with, and compares these batch-wise with the concatenation of ground truth -pairs. Note that in contrast to this, the corresponding MMD loss of the INN only receives , and no information about . For this small toy problem, we find that the hand-crafted MMD loss dramatically improves results compared to the smaller learned discriminator.


We also compare to a conditional Variational Autoencoder of same total size as the INN. There is some similarity between the training setup of our method (Fig. 7, right) and that of cVAE (Fig. 7, left), as the forward and inverse pass of an INN can also be seen as an encoder-decoder pair. The main differences are that the cVAE learns the relationship only indirectly, since there is no explicit loss for it, and that the INN requires no reconstruction loss, since it is bijective by construction.


We adapt the cVAE to use Inverse Autoregressive Flow [Kingma et al., 2016] between the encoder and decoder. On the Gaussian mixture toy problem, the trained cVAE-IAF generates correct posteriors on par with our INN (see Fig. 6).

Dropout sampling

The method of dropout sampling with learned error terms is by construction not able to produce multi-modal outputs, and therefore fails on this task.

Invertible Neural Network

Standard (Bayesian) Neural Network

Invertible Neural Network

Conditional VAE with Inverse Autoregressive Flow


forward (simulation):

inverse (sampling):







reconstruction loss

ELBO loss
Figure 7: Abstraction of the cVAE-IAF training scheme compared to our INN from Fig. 1. For the standard cVAE, the IAF component is omitted.

2.1 Latent space analysis

To analyze how the latent space of our INN is structured for this task, we choose a fixed label and sample from a dense grid. For each , we compute through our inverse network and colorize this point in latent () space according to the distance from the closest mode in -space. We can see that our network learns to shape the latent space such that each mode receives the expected fraction of samples (Fig. 8).

Figure 8: Layout of INN latent space for one fixed label , colored by mode closest to . For each latent position , the hue encodes which mode the corresponding belongs to and the luminosity encodes how close is to this mode. Note that colors used here do not relate to those in Fig. 2, and encode the position instead of the label . The first three columns correspond to labels green, blue and red Fig. 2. White circles mark areas that contain 50% and 90% of the probability mass of latent prior .

3 Artificial data – inverse kinematics

A short video demonstrating the structure of our INN’s latent space can be found under, for a slightly different arm setup.

The dataset is constucted using gaussian priors , with and . The forward process is given by


with the arm lenghts , , .

To judge the quality of posteriors, we quantify both the re-simulation error and the calibration error over the test set, as in Sec. 4.2 of the paper. Because of the cheap simulation, we average the re-simulation error over the whole posterior, and not only the MAP estimate. In Table 2, we find that the INN has a clear advantage in both metrics, confirming the observations from Fig. 3.

Method Mean re-sim. err. Median re-sim. err. Calibration err.
cVAE 0.0368 0.0307 7.78%
cVAE-IAF 0.0368 0.0307 7.81%
INN 0.0139 0.0113 0.96%
Table 2: Quantitative evaluation of the inverse kinematics experiment
Figure 9: Posteriors generated for less challenging observations than in Fig. 3.

4 Multispectral measurements of biological tissue

The following figure shows the results when the INN trained in Sec. 4.2 is applied pixel-wise to multispectral endoscopic footage. In addition to estimating the oxygenation

, we measure the uncertainty in the form of the 68% confidence interval.

Figure 10: INN applied to real footage to predict oxygenation and uncertainty. The clips (arrows) on the connecting tissue cause lower oxygenation (blue) in the small intestine. Uncertainty is low in crucial areas and high only at some edges and specularities.

5 Star cluster spectral data

Star clusters are born from a large reservoir of gas and dust that permeates the Galaxy, the interstellar medium (ISM). The densest parts of the ISM are called molecular clouds, and star formation occurs in regions that become unstable under their own weight. The process is governed by the complex interplay of competing physical agents such as gravity, turbulence, magnetic fields, and radiation; with stellar feedback playing a decisive regulatory role [S. Klessen and C. O. Glover, 2016]. To characterize the impact of the energy and momentum input from young star clusters on the dynamical evolution of the ISM, astronomers frequently study emission lines from chemical elements such as hydrogen or oxygen. These lines are produced when gas is ionized by stellar radiation, and their relative intensities depend on the ionization potential of the chemical species, the spectrum of the ionizing radiation, the gas density as well as the 3D geometry of the cloud, and the absolute intensity of the radiation [Pellegrini et al., 2011]. Key diagnostic tools are the so-called BPT diagrams [after Baldwin et al., 1981] emission of ionized hydrogen, H+ , to normalize the recombination lines of O++ , O+ and S+ [see also Kewley et al., 2013]. We investigate the dynamical feedback of young star clusters on their parental cloud using the WARPFIELD 1D model developed by Rahner et al. [2017]. It follows the entire temporal evolution of the system until the cloud is destroyed, which could take several stellar populations to happen. At each timestep we employ radiative transfer calculations [Reissl et al., 2016] to generate synthetic emission line maps which we use to train the neural network. Similar to the medical application from Section 4.2, the mapping from simulated observations to underlying physical parameters (such as cloud and cluster mass, and total age of the system) is highly degenerate and ill-posed. As an intermediary step, we therefore train our forward model to predict the observable quantities (emission line ratios) from composite simulation outputs (such as ionizing luminosity and emission rate, cloud density, expansion velocity, and age of the youngest cluster in the system, which in the case of multiple stellar populations could be considerably smaller than the total age). Using the inverse of our trained model for a given set of observations , we can obtain a distribution over the unobservable properties of the system.

Results for one specific y are shown in Fig. 5. Note that our network recovers a decidedly multimodal distribution of that visibly deviates from the prior . Note also the strong correlations in the system. For example, the measurements investigated may correspond to a young cluster with large expansion velocity, or to an older system that expands slowly. Finding these ambiguities in and identifying degeneracies in the underlying model are pivotal aspects of astrophysical research, and a method to effectively approximate full posterior distributions has the potential to lead to a major breakthrough in this field.

6 Calibration curve for tissue parameter estimation

In Sec. 4.2, we report the median calibration error for each method. The following figure plots the calibration error, , against the level of confidence . Negative values mean that a model is overconfident, while positive values say the opposite.

Figure 11: Calibration curves for all four methods compared in Sec. 4.2.

7 Approximate Bayesian computation (ABC)

While there is a whole field of research concerned with ABC approaches and their efficiency-accuracy tradeoffs, our use of the method here is limited to the essential principle of rejection sampling. When we require samples of from the posterior conditioned on some , there are two basic ways to obtain them:


We set an acceptance threshold , repeatedly draw -samples from the prior, compute the corresponding -values (via simulation) and keep those where , until we have accepted samples. The smaller we want , the more simulations have to be run, which is why we use this approach only for the experiment in Sec. 2, where we can afford to run the forward process millions or even billions of times.


Alternatively, we choose what quantile

of samples shall be accepted, and then run exactly simulations. All sampled pairs are sorted by and the closest to form the posterior. This allows for a more predictable runtime when the simulations are costly, as in the medical application in Sec. 4.2 where .

8 Details of datasets and network architectures

Table 3 summarizes the datasets used throughout the paper. The architecture details are given in the following.

Experiment training data see also
Gaussian mixture
Inverse kinematics
Medical data Wirkert et al. [2016]
Astronomy Pellegrini et al. [2011]
Table 3: Dimensionalities and training set sizes for each experiment.

8.1 Artificial data – Gaussian mixture


3 invertible blocks, 3 fully connected layers per affine coefficient function with ReLU activation functions in the intermediate layers, zero padding to a nominal dimension of 16, Adam optimizer, decaying learning rate from

to , batch size 200. The inverse multiquadratic kernel was used for MMD, with in both - and -space.

Dropout sampling:

6 fully connected layers with ReLU activations, Adam optimizer, learning rate decay from to , batch size 200, dropout probability .


6 fully connected layers for the generator and 8 for the discriminator, all with leaky ReLU activations. Adam was used for the generator, SGD for the discriminator, learning rates decaying from to , batch size 256. Initially 100 iterations training with , to separate the differently labeled modes, followed by pure GAN training.

Larger cGAN:

2 fully connected layers with 1024 neurons each for discriminator and generator, batch size 512, Adam optimizer with learning rate

for the generator, SGD with learning rate and momentum 0.05 for the discriminator, weight decay for both, 0.25 dropout probabiliy for the generator at training and test time. Equal weighting of discriminator loss and penalty of output variance

Generator with MMD:

8 fully connected layers with leaky ReLU activations, Adam optimizer, decaying learning rate from to , batch size 256. Inverse multiquadratic kernel, .


3 fully connected layers each for encoder and decoder, ReLU activations, learning rate

, decay to , Adam optimizer, batch size 25, reconstruction loss weighted 50:1 versus KL divergence loss.

8.2 Artificial data – inverse kinematics


6 affine coupling blocks with 3 fully connected layers each and leaky ReLU activations. Adam optimizer, decaying learning rate from to , multiquadratic kernel with .


4 fully connected layers each for encoder and decoder, ReLU activations, learning rate , decay to , Adam optimizer, batch size 250, reconstruction loss weighted 15:1 versus KL divergence loss.

8.3 Functional parameter estimation from multispectral tissue images


3 invertible blocks, 4 fully connected layers per affine coefficient function with leaky ReLUs in the intermediate layers, zero padding to double the original width. Adam optimizer, learning rate decay from to , batch size 200. Inverse multiquadratic kernel with , weighted MMD terms by observation distance with decaying to 0.

Dropout sampling/point estimate:

8 fully connected layers, ReLU activations, Adam with decaying learning rate from to , batch size 100, dropout probability .


4 fully connected layers each for encoder and decoder, ReLU activations, learning rate , decay to , Adam optimizer, batch size 25, reconstruction loss weighted :1 versus KL divergence loss.

8.4 Impact of star clusters on the dynamical evolution of the galactic gas


5 invertible blocks, 4 fully connected layers per affine coefficient function with leaky ReLUs in the intermediate layers, no additional zero padding. Adam optimizer with decaying learning rate from to , batch size 500. Kernel for latent space: with . Kernel for -space: . Due to the complex nature of the prior distributions, this was the kernel found to capture the details correctly, whereas the peak of the inverse multiquadratic kernel was too broad for this purpose.