Bayesian Uncertainty Quantification by Deep Generative Model
We develop a generative model-based approach to Bayesian inverse problems, such as image reconstruction from noisy and incomplete images. Our framework addresses two common challenges of Bayesian reconstructions: 1) It makes use of complex, data-driven priors that comprise all available information about the uncorrupted data distribution. 2) It enables computationally tractable uncertainty quantification in the form of posterior analysis in latent and data space. The method is very efficient in that the generative model only has to be trained once on an uncorrupted data set, after that, the procedure can be used for arbitrary corruption types.READ FULL TEXT VIEW PDF
Employing deep neural networks as natural image priors to solve inverse
We consider uncertainty aware compressive sensing when the prior distrib...
Uncertainty quantification has been a core of the statistical machine
Nowadays, hospitals are ubiquitous and integral to modern society. Patie...
Reconstruction of signals from compressively sensed measurements is an
We present a Bayesian machine learning architecture that combines a
The first step to realize automatic experimental data analysis for fusio...
Bayesian Uncertainty Quantification by Deep Generative Model
Data reconstruction from corrupted data lies at the heart of Bayesian inverse problems. There are many ways in which data can be corrupted, ranging from missing data (or masking) to noise and blurring. In many cases the reconstructed data, e.g. an image, is high dimensional. Data reconstruction commonly faces two challenges: 1) For most data it is difficult to find a prior that optimally represents our knowledge of the uncorrupted data. Instead, priors are often chosen to impose certain regularity conditions (e.g. maximum entropy, smoothness). 2) While it is usually tractable to find the maximum of the posterior, i.e. the most probable underlying realization, a full uncertainty quantification of the reconstruction is usually prohibitively expensive.
We propose to address these challenges with generative models, which provide a mapping from points in a typically lower dimensional latent space to points in the high dimensional data space. As an example of such a model we will be using a Variational AutoEncoder (VAE)(Kingma & Welling, 2013; Rezende et al., 2014). However, the use of a VAE is not a limitation of this method and other generative models could be used instead. VAEs are designed to model the distribution of high-dimensional input data, , by introducing a mapping to a lower dimensional latent representation, . The latent space variables are enforced to follow a given prior distribution,
, which is typically chosen to be a standard normal distribution.111Other generative models might not fulfill this condition of a Gaussian latent space distribution. For our method to work we only require the distribution of latent space variables to be fairly well behaved such that it can be mapped to a Gaussian, e.g. with a bijective normalizing flow, as we explain later.
Given a generative model, the posterior of the latent variables for a given data realization can be modeled with Bayes rule
This formulation allows to address both aforementioned problems: 1) The prior distribution,
, reflects the distribution of the training data. 2) The representation of the posterior in the lower dimensional latent space enables tractable posterior analysis. In particular it allows to examine and fit the posterior distribution and draw samples from it. The samples can then be visualized in data space by forward modeling with the generative model. This approach to uncertainty quantification is in the spirit of commonly applied sampling approaches like Hamiltonian Monte Carlo (HMC) sampling. In fact, it is the only way to reliably quantify uncertainty in high dimensional spaces, where often already the covariance is too big to fit into memory. Quantifying uncertainty by means of the variance instead of the covariance can be misleading in the presence of strong correlations between pixels, which are typically expected for image data. In addition, neither variance nor covariance can be used to fully characterize non-Gaussian distributions such as the multimodal distributions, which are common for corrupted data. All of these properties (complex correlations, multimodality) are captured in samples drawn from a sufficiently accurate fit to the posterior distribution in latent space.
In addition to addressing the problems of accurate priors and tractable posterior analysis, the suggested approach is very efficient, because the generative model only has to be trained once on uncorrupted data and can then be used for arbitrary types of data corruption.
In this work we consider linear inverse problems of the form , where is the signal to be recovered, are the observations, represents Gaussian noise, and is a corruption operator (i.e. a masking operator for an inpainting problem, or a convolution by a blurring kernel for a deconvolution problem).
In our Bayesian approach, we assume that the uncorrupted signal, , is drawn from a latent variable model, , with normal Gaussian prior . While different likelihood models could be used to describe , we adopt a Gaussian model , where
is parameterized by a deep neural network. The variance
measures the modeling error, i.e. the average discrepancy between an input image and its reconstruction after encoding and decoding with the generative model. To get an estimate of the size of this modeling error we suggest to measure its value after having trained the generative model or to leave it as a free, trainable parameter during the training. The Gaussian choice for the likelihood of the generative model allows us to easily combine it with the probability distribution of the noise in the corrupted data. We train the generative model on uncorrupted data,. Here, we use a vanilla VAE with mean-field Gaussian posterior and train it under the Evidence Lower BOund (ELBO) (Kingma & Welling, 2013).
Once the generative model is trained, we can use it to expand the inverse problem formulation , where we have replaced the signal, , by its generative process and added which accounts for the generator’s modeling error. Contrary to the previous formulation, we have now stated the problem in terms of a variable, , for which we have a simple and adequate prior (i.e. a normal Gaussian), allowing us to apply the Bayesian approach , where is the posterior distribution of the latent variables for a given observation . Note that this formulation does not use the encoder of the generative model nor does it depend on the approximate posterior used in the VAE training.
To provide a concrete example, let us consider a problem combining missing data and Gaussian noise, for which A is a binary masking matrix with entries equal to 1 if data is observed, 0 otherwise. The expression for the log posterior becomes,
where . Note that this expression would change for different corruption types, while the generative model would remain the same, i.e. it does not need to be retrained.
Despite the dimensionality reduction, the analysis of this posterior is still challenging: 1) It is often multi-modal, because multiple solutions agree with the data. 2) Even in low dimensions, standard methods such as MCMC can be prohibitively slow and may give incorrect results in multi-modal situations (Wu et al., 2018). Here, we adopt the ELO method (Seljak & Yu, 2019) to address these challenges. ELO replaces the true posterior, by a parameterized approximation, , and finds the optimal parameters, , by minimizing the L f-divergence between the two distributions. This approach needs few sampling points, does not suffer from sampling noise, and can be orders of magnitude faster than stochastic variational inference (SVI) (Seljak & Yu, 2019).
In our examples, we use a mixture of multivariate Gaussians (GMM) as posterior approximation. Under the assumption of well separated mixture components, the ELO procedure for a GMM involves the following 3 steps: 1) finding all relevant minima in the posterior by optimization, 2) fitting Gaussians around these minima by using the Laplace approximation, 3) improving the probability distribution beyond the Gaussian approximation, if needed, 4) finding the relative weights of these mixture components. Details on the ELO procedure including equations are provided in Appendix A. While in principle SVI could be used instead of the ELO procedure, we find in our examples that ELO gives more accurate results and requires less tuning, while SVI is prone to numerical instabilities when fitting a full rank Gaussian (see Appendix A.3 for a direct comparison on an example).
Left panel: Data imputation, denoising and deblurring examples. From top to bottom row we show corrupted data, reconstructions obtained from forward modeling the latent space variables at the deepest minimum of the negative log posterior (the MAP solutions) and the underlying true images. Middle panel: Forward modeled samples of the example in the last column of the left panel (drawn from the two component Gaussian mixture posterior shown in Figure3). We get a mixture of 4’s and 9’s corresponding to the two posterior peaks. Right panel: samples overplotted with the mask demonstrating that they are compatible with the input data within the errors of the generative model.
To demonstrate the methodology, we consider several examples of data corruption on the MNIST dataset (Lecun et al., 1998). We begin by training a VAE on the uncorrupted training set following a setting similar to (Gritsenko et al., 2019) with both encoder, , and decoder, , parameterized as a sequence of 4 fully-connected ResNet blocks, each block containing 2 fully-connected layers with size 512 and LeakyRelu activation. The encoder reduces the dimensionality to 10. We minimize the loss using ADAM (Kingma & Ba, 2014) with default parameters, a batch size of 1024, and a decreasing learning rate starting at 0.001.
Training a VAE with the ELBO as objective function is not guaranteed to result in an encoded distribution that perfectly matches the prior. This manifests itself in poor sample qualities when forward modeling samples drawn from the prior distribution. A number of approaches have been suggested to improve on this; most of them modify the objective function (Makhzani et al., 2015; Mescheder et al., 2017; Tolstikhin et al., 2017; Ulyanov et al., 2017; Beitler et al., 2018). Here we choose a simple, different approach: to ensure that the prior is well described by a unit variance Gaussian, we augment the forward model by a normalizing flow (specifically, we use a RealNVP (Dinh et al., 2016)). The normalizing flow is trained to map the latent space distribution of the VAE to a standard normal distribution. The full forward model, i.e. the mapping from latent space to data space then simply becomes a successive application of the generator of the normalizing flow followed by the generator of the VAE. The entire procedure is also illustrated in Appendix B. In Figure 2 we show how the samples from the generative model improve after this augmentation. Adding an additional flow model might not be required for all data sets and generative models. If it is required can be judged from the sample quality. In our example, the latent space distribution is already close to a Gaussian. In this case, a RealNVP with eight affine coupling layers, each with hidden layers consisting of [512,512] units, is sufficient to achieve almost perfect samples.
We then use our reconstruction technique on corrupted images produced from the MNIST test data, shown in the top row of the left panel in Figure 1. In the first we randomly remove 95% of the pixels (Dalca et al., 2019), in the second we mask half of the image and add noise with to the other half, in the third we blur the image by convolution with a Gaussian kernel. In the last example we apply a broad mask, such that different digits (4s and 9s) are compatible with the data.
To reconstruct the images, we perform the posterior analysis outlined in the previous section. We first use optimization to find all local minima of the negative log posterior: we start from different random positions drawn from the prior and use the ADAM optimizer to descend to the local minima. We stop our search once we find that the minimization procedure repeatedly converges to the same minima, typically after ~10 minimizations. We plot reconstructions from the lowest minimum in the middle row of the left panel of Figure 1 and the true input image for comparison below them. An example of a latent space posterior is shown in Figure 3: there are two posterior peaks that dominate the posterior mass, with the first one containing 25% and the second one containing 75% of the posterior mass.
We represent uncertainty quantification in data space by drawing samples from the latent space posterior, and evaluating them in data space using the generator. In the example of the masked 4 there is a range of possible solutions allowed given the broad and bi-modal posterior (Figure 3): some of the drawn digits are closer to a 4 while others are closer to a 9. We plot unmasked and masked samples in the two right panels of Figure 1 to show that the unmasked data is reproduced within the errors. We provide code to reproduce our experiments in a public github repository .
Related work:Deep learning approaches to inverse problems (e.g. Dong et al., 2015; Jin et al., 2016; Putzky & Welling, 2017; Ulyanov et al., 2017), can return high quality point estimates, but usually do not provide uncertainty estimates, which are crucial to many applications (e.g, medical imaging or scientific applications). Data imputation with VAEs was initially suggested by Rezende et al. (2014) and further refined by Mattei & Frellsen (2018a, b). These methods rely on expensive sampling to marginalize over and evaluate . An approach similar to this work was recently developed by Wu et al. (2018), but uses SVI for fitting the posterior model, which can be orders of magnitude slower than the ELO procedure used in this work. An alternative approach to uncertainty quantification based on learning to sample from the posterior with a W-GAN was proposed by Adler & Öktem (2018). Their approach requires retraining for different corruption types, and should be affected by the well known limitations of GANs (training instability and mode collapse).
Uncertainty quantification of complex and strongly corrupted data can be challenging, and Bayesian posterior analysis is an approach that can quantify complex posterior distributions such as multimodal solutions. It requires a reliable prior distribution in latent space, which can be obtained using generative models such as a VAE. Posterior analysis with MCMC is often too slow and can be trapped in a single mode. We propose a fast alternative based on the L f-divergence that gives high quality posteriors in latent space and in data space.
On the relationship between normalising flows and variational- and denoising autoencoders, 2019.URL https://openreview.net/forum?id=HklKEUUY_E.
Journal of Machine Learning Research, 18:14:1–14:45, 2017. URL http://jmlr.org/papers/v18/16-107.html.
Stochastic backpropagation and approximate inference in deep generative models.In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1278–1286, 2014. URL http://jmlr.org/proceedings/papers/v32/rezende14.html.
ELO (Seljak & Yu, 2019) is an efficient and numerically stable procedure for fitting a parameterized distribution to another distribution , when we can evaluate for a given , but do not have access to the analytical form of the distribution. In its most general form ELO determines the parameters, , of by minimizing the following objective function,
where the expectation value is taken by averaging over samples from an arbitrary distribution (ideally is chosen to be close to the real distribution ), is the number of terms included, and the factor can be used to give different weight to different derivative orders. The sum over indices should be symmetrized to avoid double counting.
Setting , Eq. 3 becomes
with . Note that we have added a normalization, , in the last term to account for the fact that might not be properly normalized. The last term is only required if the correct normalization for has to be determined, e.g. if is a mixture of Gaussians and their relative weights have to be found.
Specializing to the Gaussian case, , we have . For random sampling points, , the is minimized by setting
This results in a first estimate of , which can be used to draw new samples, , for which the fitting procedure can be repeated, etc. and variational inference by minimization of the KL-divergence, , converge to the same result in the high sample limit, if . However, has the advantage of lower sampling noise: the presence of in the last equation guarantees that there is no sampling noise. This term is missing in stochastic KL minimization (Opper & Archambeau, 2009).
We can further accelerate the convergence by choosing the initial sampling points wisely, e.g. for a unimodal distribution, one can first find the maximum, , of through gradient descent. If one chooses , simplifies to the well known Laplace approximation.
Since we expect the posterior of corrupted data to be generally multimodal, we use to fit a Gaussian mixture model (GMM), . Assuming that the distribution has well separated maxima, , finding the parameters of the GMM proceeds similarly as for the single Gaussian case. In fact, we can simply fit Gaussians as outlined above around the maxima. In addition, we need to determine their relative weights, . The additional constraint comes from the last term in Eq. 4 (which was not required to find the parameters of the single Gaussian). We see that gets minimized if we set
and require .
We use the fitting procedure outlined in the previous section for the examples shown in the main text. First, we determine all the relevant minima by running several gradient descents with decreasing learning rate from random starting points that we draw from the prior distribution. Each minimization only takes of the order of 10 seconds and we find that we rarely need to start from more than 10 different starting points to find all the relevant minima. Once the gradient descent has converged, we check the Hessian, , at the endpoint for positive definiteness. This step ensures that we do not mistake saddle points as minima. We have found that the minimization is less likely to get stuck at points other than actual minima if we apply an annealing scheme in which we initially downweight the likelihood compared to prior. An example of the outcomes of twenty minmizations from different starting points for the example of the masked 4 (last column in Figure 1), which has the most complex posterior of all our examples, is shown in Figure 4.
We can clearly see that the minimization reliably finds one of the two deepest minima. Note that following the relative importance of minima (their relative weights) does depend their posterior mass, which is determined by both their depth and width. A deep, but very local minimum might have less posterior mass than a shallower, wide minimum. Our minimization procedure is unlikely to miss wide minima and in all examples, we find that the forward modeled lowest minimum is always very close to the input, suggesting that it is the global minimum.
After minimization and discarding all points with non-positive definite Hessian, we identify separate minima, , by comparing their distance in latent space (per latent space dimension ) to their width (the width is estimated from the variance, , at the respective minima). Having identified all separate minima, we fit Gaussian around them and determine their relative weights with Eq. 7.
Comparison of different posterior fitting procedures on the example of the noisy, half masked four. In the two images on the left, we show the true underlying image and the input data, which has been corrupted by adding white noise and a mask covering half of the image. In the right three images we show the maxima of the fitted Gaussians forward modeled to data space, starting with theestimate (MAP solution), followed by the mean field VI estimate and the full rank VI estimate on the right. The VI estimates are slightly different to the estimate, reflecting the fact that their maxima do not coincide with the global maximum (c.p. Fig. 6).
We compare different posterior fitting procedures for the second example in Figure 1, for which we found the posterior to be to good approximation unimodal. In this case, reduces to finding the minimum and fitting a full rank Gaussian by means of the Laplace approximation. For comparison, we also fit Gaussians with diagonal (mean field approximation) and full rank covariance to the posterior by stochastically minimizing the evidence lower bound (ELBO) (Kucukelbir et al., 2017). For we run 10 minimization starting from random starting points, each taking about 10 seconds. Out of these ten minimizations, 4 converge to the deepest minimum, and we find that the contribution of the other local minima to the posterior is suppressed. We compute the curvature at the global minimum by automated differentiation and use its inverse as an estimate of the covariance of our Gaussian fit. For the stochastic VI estimate we minimize the ELBO using the ADAM optimizer with decreasing learning rate. We draw 512 samples from the approximate posterior to estimate the gradient in each step (we see a clear trend of improved results with increased number of samples). We find that we need of the order of 8000 steps for the ELBO to converge, taking between 130 (mean field) and 170 (full rank) seconds, making it in both cases slower than running 10 minimizations for the estimate. Despite using the Cholesky decomposition and ensuring that the diagonal is positive with a softplus transform, we find that we have to add a small constant offset to the diagonal of the estimated covariance to make the minimization of the ELBO numerically stable. We plot the outcomes of these fits in Figure 6. The red line shows the true negative log posterior probed along one specific latent space dimension (keeping the other dimensions fixed at the position of the global minimum), the other lines show the respective estimates from the fitted posteriors, , restricted to the same dimension. We use the value of the true negative log posterior evaluated at the mean of the fitted posterior as anchor point for the fits (which is why the minima of the SVI estimates do not need to lie on the red line). The minima of the negative log posteriors fitted with the ELBO objective do not coincide with the true mininum. The reconstructions from these points have less similarity with the truth than the MAP estimate (Fig. 5), the fact that they still look similar to the MAP solution is probably owed to the relative simplicity of the posterior shape, but could become more problematic for other posterior shapes.
As described in the main text, we start by training the encoder and decoder of a VAE under the evidence lower bound (Figure 7), assuming a Gaussian likelihood. The modeling error of this VAE can be estimated from the root mean squared (rms) difference between the input and the reconstruction or by allowing the variance of the likelihood to be a trainable variable and using its value at the end of training as an estimate.
To ensure that the latent space distribution is well described by the prior, we add an additional component to the forward model: We first encode the data with the VAE encoder. We then fit a bijective normalizing flow, , to map this distribution to a normal distribution (Figure 8).
After having trained both the VAE and the normalizing flow, we build our complete forward model ( Figure 9): The hidden variable is passed through the bijective flow and the decoder of the VAE to produce a realization of the uncorrupted data . This data is then corrupted by the operator and a realization of the measurement noise is added. This produces a realization of noisy corrupted data.