Bayesian inference is a well-established technique for quantifying uncertainties in inverse problems that are constrained by physical principles kaipio2006statistical; dashti2016bayesian; Polpo2018. It has found applications in diverse fields such as geophysics Gouveia1997; Malinverno2002; Martin2012; Isaac2015, climate modeling Jackson2004, chemical kinetics Najm2009, heat conduction Wang_2004, astrophysics Loredo1990; AsensioRamos2007, materials modeling Sabin2000 and the detection and diagnosis of disease Siltanen2003; Kolehmainen2006. The two critical ingredients of a Bayesian inference problem are - an informative prior representing the prior belief about the parameters to be inferred and an efficient method for sampling from the posterior distribution. In this manuscript we describe how certain deep generative techniques can be effectively used in these roles. In this section, we provide a brief introduction to Bayesian inference as it is applied to solving inverse problems and to generative adversarial networks (GANs), which are a popular class of deep generative algorithms. Then in the following sections, we described how GANs may be used within a Bayesian context.
1.1 Bayesian inference
We consider the setting where we wish to infer a vector of parameters from the measurement of a related vector , where the two are related through a forward model . A noisy measurement of is denoted by ,
where the vector represents noise. While the forward map is typically well-posed, its inverse is not, and hence to infer from the measurement requires techniques that account for this ill-posedness. Classical techniques based on regularization tackle this ill-posedness by using additional information about the sought solution field explicitly or implicitly tarantola2005inverse
. Bayesian inference offers a different approach to this problem by modeling the unknown solution as well as the measurements as random variables. This statistical framework addresses the ill-posedness of the inverse problem, and allows for the characterization of the uncertainty in the inferred solution.
The notion of a prior distribution plays a key role in Bayesian inference. It is assumed that through multiple observations of the field , denoted by the set , we have prior knowledge of that can be utilized when inferring from . This is used to build, or intuit, a prior distribution for , denoted by
. Some typical examples of priors include Gaussian process prior with specified co-variance kernels, Gaussian Markov random fieldsFahrmeir2001, Gaussian priors defined through differential operators Stuart2010, and hierarchical Gaussian priors Marzouk2009; Calvetti2008. These priors promote some smoothness or structure in the inferred solution and can be expressed explicitly in an analytical form.
Another key component of Bayesian inference is a distribution that represents the likelihood of given an instance of , denoted by . This is often determined by the distribution of the error in the model, denoted by , which captures both model and measurement errors. Given this, and an additive model for noise (1), the posterior distribution of
, determined using Bayes’ theorem after accounting for the observationis given by,
is the prior-predictive distribution of and ensures that the posterior is a true distribution and integrates to one.
The posterior distribution above completely characterizes the uncertainty in ; however for vectors of large dimension (that is, large
) characterizing this distribution explicitly is a challenging task. Consequently the expression above is used to perform tasks that are more manageable. These include determining estimates such as the maximum a-posteriori estimate (MAP) which represents the value ofthat maximizes the posterior distribution, expanding the posterior distribution in terms of other distributions that are simpler to work with Bui-Thanh2012
, or using techniques like Markov Chain Monte-Carlo (MCMC) to generate samples that are “close” to the samples generated by the true posterior distributionhan2001markov; parno2018transport.
Despite its numerous applications and successes in solving inverse problems, Bayesian inference faces significant challenges. These include
defining a reliable and informative prior distribution for when the set is difficult to characterize mathematically.
efficiently sampling from the posterior distribution when the dimension of () is large,a typical situation in many practical science and engineering applications.
1.2 Generative adversarial networks
Generative adversarial networks, or GANs, are a class of generative deep neural networks based on a two-player min-max game that have found many applications since their advent goodfellow2014generative. As shown in Figure 1, they comprise of a generator that maps a latent vector to , where typically,
. The components of the latent vector are selected from a simple distribution, typically a Gaussian or a uniform distribution. The generator up-scales these components through successive application of non-linear transformation at each layer, whose parameters are learned by the algorithm during training.
The other component of a GAN is a discriminator, which is also composed of successive non-linear transformations. However, these transformations are designed to down-scale the original input. The final few layers of the discriminator are fully connected neural networks which lead to a simple classifier (like a soft-max, for example). The discriminator maps an input field,
, to a binary variable, which indicates whether the input is “real” or “fake”. The discriminator’s weights are also learned during training.
The generator and the discriminator are trained in an adversarial manner. The training data for the discriminator is comprised of the set of real instances of , that is
, and a set of “fake” instances generated by the generator, along with the corresponding label: fake or real. The loss function is driven by the accuracy with which the discriminator correctly labels each image. The generator is trained by passing its output through the discriminator and requiring it to be labeled as “real.” Thus while the generator is trained to “fool” the discriminator, the discriminator is trained so as not to be fooled by the generator.
By carefully selecting the loss function the adversarial training process described above can be interpreted as ensuring similarity (in an appropriate measure) between the true distribution of , denoted by , and the distribution of the samples generated by the generator, denoted by . A particular family of GANs, called the Wasserstein GAN, which minimizes the Wasserstein metric between and , has emerged as one of the most popular type of GAN due to its better stability properties Arjovsky2017; gulrajani2017improved.
In several applications, GANs have demonstrated a remarkable ability to approximate the underlying true distribution goodfellow2014generative; Makhzani2015; Dumoulin2016; Mescheder2017; Brock2018; Karras2017; fedus2018maskgan; tulyakov2018mocogan; Ma2017; Wang2017. Further, since samples from the approximate distribution are generated by sampling from the much simpler distribution of the latent vector (whose dimension is much smaller than that of ), they have been applied to generate numerous samples of consistent with .
1.3 Related work
The main idea developed in this paper involves training a GAN using the sample set , and then using the distribution learned by the GAN as the prior distribution in Bayesian inference. This leads to a useful method for representing complex prior distributions and an efficient approach for sampling from the posterior distribution by re-writing it in terms of the latent vector .
The solution of an inverse problem using sample-based priors has a rich history (see Vauhkonen1997; Calvetti2005 for example). As does the idea of reducing the dimension of the parameter space by mapping it to a lower-dimensional space Marzouk2009; Lieberman2010. However, the use of GANs in these tasks is novel.
Recently, a number of authors have considered the use machine learning-based methods for solving inverse problems. These include the use of convolutional neural networks (CNNs) to solve physics-driven inverse problemsAdler2017; Jin2017; patel2019circumventing
, and GANs to solve problems in computer visionChang; Kupyn2018; Yang2018; Lediga; Anirudh2018; pix2pix2016; CycleGAN2017; Kim2017. There is also a growing body of work dedicated to using GANs to learn regularizers in solving inverse problems lunz2018adversarial and in compressed sensing bora2017compressed; bora2018ambientgan; kabkab2018task; wu2019deep; shah2018solving. However, these approaches differs from ours in at least two significant ways. First, they solve the inverse problem as an optimization problem and do not rely on Bayesian inference; as a result, regularization is added in an ad-hoc manner, and no attempt is made to quantify the uncertainty in the inferred field. Second, the forward map is assumed to satisfy an extension of the restricted isometry property, which may not be the case for forward maps induced by physics-based operators.
More recently, the approach described in adler2018deep utilizes GANs in a Bayesian setting; however the GAN is trained to approximate the posterior distribution (and not the prior, as in our case), and training is done in a supervised fashion. That is, paired samples of the measurement and the corresponding true solution are required. In contrast, our approach is unsupervised, where we require only samples of the true solution to train the GAN prior.
The layout of the remainder of this paper is as follows. In Section 2, we develop a formulation for Bayesian inference when the prior distribution is defined by a GAN and describe techniques for sampling from this distribution. Thereafter in Section 3, we utilize these techniques to solve an inverse problem and quantify uncertainty in our solution. We end with conclusions in Section 4.
2 Problem Formulation
The central idea of this paper is to train a GAN using the sample set and then use the distribution defined by the GAN as the prior distribution in Bayesian inference. As described in this section, this leads to a useful method for representing complex prior distributions and an efficient approach for sampling from the posterior.
Let denote the set of instances of vector sampled from the true distribution, . Further, let characterize the latent vector space and be the generator of a GAN trained using . Then according to goodfellow2014generative, with infinite capacity and sufficient data, the generator learns the true distribution. That is,
The distribution is defined as
Here is the multivariate distribution of the latent vector whose components are iid and typically conform to a Gaussian or a uniform distribution. The equation above implies that the GAN creates synthetic samples of by first sampling from and then passing these samples through the generator.
Now consider a measurement from which we would like to infer the posterior distribution of . For this we use (2) and set the prior distribution to be equal to the true distribution, that is . Then, under the conditions of infinite capacity of the GAN, and sufficient data, from (3), this is the same as setting in this formula. Therefore,
Now for any , we have
where is the expectation operator, and
Note that the distribution is the analog of in the latent vector space. The measurement updates the prior distribution for to the posterior distribution. Similarly, it updates the prior distribution for , , to the posterior distribution, , defined above.
Equation (6) implies that sampling from the posterior distribution for is equivalent to sampling from the posterior distribution for and transforming the sample through the generator . That is,
Since the dimension of is typically much smaller than that of , and since the operation of the generator is typically inexpensive to compute, this represents an efficient approach to sampling from the posterior of .
Note that the left hand side of (6) is the expression for a population parameter of the posterior, defined by . The right hand sides of the last two lines of this equation describe how this parameter may be evaluated by sampling (instead of ) from either or . In the following section we describe sums that approximate these integrals.
2.1 Sampling from the posterior distribution
We consider the following scenario:
We wish to infer and characterize the uncertainty in the vector of parameters from a noisy measurement of denoted by in (1), where is a known map that connects and .
We have several prior measurements of plausible , contained in the set .
For this problem we propose the following algorithm that accounts for the prior information in and the “new” measurement through a Bayesian update:
Train a GAN with a generator on .
Sample from given in (8).
With sufficient capacity in the GAN and with sufficient training, the posterior obtained using this algorithm will converge to the true posterior. Further, since GANs can be used to represent complex distributions efficiently, this algorithm provides a means of including complex priors that are solely defined by samples within a Bayesian update.
As mentioned earlier, an efficient approach to sampling from is to recognize that the dimension of is typically much smaller ( - ) than that of ( - ). We now describe two approaches for estimating population parameters of the posterior that make use of this observation.
Monte-Carlo (MC) approximation
The first approach is based on a Monte-Carlo approximation of a population parameter of the posterior distribution. This integral, which is defined in the second line of (6), may be approximated as,
In the equation above, the numerator is obtained from a MC approximation of the integral in (6), and the denominator is obtained from a MC approximation of the scaling parameter . We note that the sampling in this approach is rather simple since in a typical GAN the s are sampled from simple distributions like a Gaussian or a uniform distribution.
Markov-Chain Monte-Carlo (MCMC) approximation
In many applications we anticipate that the likelihood will tend to concentrate the distribution of latent vector to a small region within . Thus the MC sampling described above may be inefficient since it will include regions where the likelihood will take on very small values. A more efficient approach will be to generate an MCMC approximation using the definition in (7), and thereafter sample from this distribution. Then the corresponding sample for is given by , and from the third line of (6), any desired population parameter may be approximated as
2.2 Expression for the maximum a-posteriori estimate
The techniques described in the previous section focused on sampling from the posterior distribution and computing approximations to population parameters. These techniques are general in that they can be applied in conjunction with any distribution used to model noise and the latent space vector; that is, any choice of (likelihood) and (prior). In this section we consider the special case when Gaussian models are used for noise and the latent vector. In this case, we can derive a simple optimization algorithm to determine the maximum a-posteriori estimate (MAP) for as described below. This point is denoted by in the latent vector space and represents the most likely value of the latent vector in the posterior distribution. It is likely that the operation of the generator on , that is , will yield a value that is close to , and may be considered as a likely solution to the inference problem.
We consider the case when the components of the latent vector are iid with a normal distribution with zero mean and unit variance. This is often the case in many typical applications of GANs. Further, we assume that the components of noise vector are defined by a normal distribution with zero mean and a covariance matrix. Using these assumptions in (7), we have
The MAP estimate for this distribution is obtained by maximizing the argument of the exponential. That is
This minimization problem may be solved using any gradient-based optimization algorithm. The input to this algorithm is the gradient of the functional with respect to , which is given by
where the matrix is defined as
Here is the derivative of the forward map with respect to its input , and is the derivative of the generator output with respect to the latent vector. In evaluating the gradient above we need to evaluate the operation of the matrices and on a vector, and not the matrices themselves. The operation of on a vector can be determined using a back-propagation algorithm with the GAN; while the operation of can be determined by making use of the adjoint of the linearization of the forward operator. In the case of a linear inverse problem, this matrix is equal to the forward map itself.
Once is determined, one may evaluate by using the GAN generator. This represents the value of the field we wish to infer at the most likely value value of latent vector. Note that this is not the same as the MAP estimate of .
It is interesting to note that in a typical Bayesian inverse problem (that does not use GANs as priors) under additive Gaussian noise and Gaussian prior with as covariance, the posterior distribution is given by,
Seeking leads to an optimization problem that is similar to the one for (13). However, there are two crucial differences. First, it is harder problem to solve since the optimization variable is , whose dimension is greater than that of . Second, while different choices of lead to different types of regularizations for the MAP (like or
), all of these tend to smooth the solution and none allow for the preservation of sharp edges, which is critical in medical imaging and other applications. Total variation (TV) type regularization strategies do allow for these types of solutions; however they do not translate to conjugate priors with Gaussian likelihood when viewed from a Bayesian perspective. In contrast to this, when using a GAN as a prior we can allow for sharp variations in, while still enjoying the benefit of conjugate priors for determining .
We have described three algorithms for probing the posterior distribution when the prior is defined by a GAN. These include an MC (9) and an MCMC estimate (10) of a given population parameter and a MAP estimate that is applicable to additive Gaussian noise with a Gaussian prior for the latent vector (12). In the following section we apply these algorithms to a canonical inverse problem.
3 Numerical results
In this section we describe a numerical example where we utilize a GAN (Wasserstein GAN, in particular) as a prior in a Bayesian inference problem. We first train a GAN on a set of realizations of the field to be inferred. Thereafter, using this GAN as a prior, and a single instance of a noisy measurement, we infer the desired field using a Bayesian update.
Since our goal is to validate the approach developed in this paper, we consider examples where we know the “correct solution.” In particular, we generate, rather than measure, the input data as follows:
We select a stochastic parametric representation for the inferred field and generate multiple samples from it. This gives us the set , which is used to train the GAN prior.
We sample once more from this parametric representation to generate the “target” instance of the inferred field. This is denoted by .
This field is transformed by the forward operator to generate a noise-free version of the measured field, . This measurement is corrupted with additive noise drawn from a known distribution to generate the measurement, .
Once this input data is generated, following the approach described in the previous section we:
Use a MC approximation to evaluate the MAP (), the mean (
) and the standard deviation (that is the square root of principal diagonal of the auto-covariance of) of each component of for the posterior distribution.
Generate a Markov chain to sample from the posterior an evaluate the statistics listed above.
The estimates and may be considered our best guess at the correct value of the inferred target field . Thus the distance between these fields represents the “error” introduced in the Bayesian inference. We note that there are three sources of this error: the loss of information inherent in the forward map , the noise in the measurement, and the approximations inherent in our algorithm. The approximation errors include the difference between the true prior distribution and the distribution learned by the GAN, and the errors induced by the MC or MCMC sampling.
3.1 Inferring the initial state in a heat conduction problem
We apply the Bayesian inference approach with a GAN prior to the problem of determining the initial temperature distribution of a Fourier solid from a measurement of its current temperature. In this case the field to be inferred () is the initial temperature, which is represented on a grid on a square of edge length units. The forward operator is defined by the solution of the time-dependent heat conduction problem with uniform conductivity, . This operator maps the initial temperature (the quantity to be inferred) to the temperature at time (the measured quantity, ). The discrete version of this operator is generated by approximating the time-dependent linear heat conduction equation using central differences in space and backward difference in time. it is given by,
In the equation above, is the second-order finite difference approximation of the Laplacian, is the time step, and is the number of time steps. In Figure 2
, we have plotted the eigenvalues of this operator as a function of mode number on a log scale. We notice that they decay exponentially with increasing mode number, indicating a significant loss of information in the higher modes. The modes would be used to represent sharp edges and corners in the inferred solution.
It is assumed that the initial temperature is zero everywhere except in a rectangular region where it is set to a non-zero value. The initial temperature field is parameterized by the horizontal and vertical coordinates of the top-left and bottom-right corners of the rectangular region and the value of the temperature field within the rectangular region. Each parameter is chosen from a uniform distribution. The top-left coordinates are constrained to be in the range , while the bottom-right coordinates are constrained to be in the range . We note that the vertical axis is positive in the downward direction. The value of temperature inside the rectangular region is a constant constrained to be in the range of units. Initial temperature fields sampled from this distribution are included in the sample set which is used to train the GAN. Four randomly selected samples from this set, which contains 50,000 images, are shown Figure 3.
We train a Wasserstein GAN (WGAN) with gradient penalty term on the set to create a generator to produce synthetic images of the initial temperature field. The detailed architecture of the generator () and the discriminator() are shown in Appendix A. The generator consists of 3 residual blocks (see He2015a) and 4 convolutional layers and the discriminator consists of 3 residual blocks and 8 convolutional layers. Both the generator and the discriminator were trained using Adam optimizer with equal learning rate of 1e-4 and momentum parameters = 0.9 and
= 0.5. The entire training and inference was performed using Tensorflowtensorflow2015-whitepaper on a workstation equipped with dual Nvidia GeForce RTX 2080Ti GPUs.
The latent vector space of the GAN comprises of 8 iid variables conforming to a Gaussian distribution with zero mean and unit variance. The training of the GAN proceeds in the standard adversarial manner (seegulrajani2017improved), and the samples generated by it become more realistic as the training progresses. Some representative images produced by the fully-trained GAN are shown in Figure 4. From these images we conclude that the GAN is able to replicate the key characteristics of the true data remarkably well. However, we also observe some slight deviations, which could likely be addressed with more training and/or capacity.
Next we generate the target field that we wish to infer and the corresponding measurement. As shown in Figure 5a, this comprises of a square patch with edge = centered on center of the total domain. This field is passed through the forward map to generate the noise-free version of the measured field, which is shown in Figure 5b. Thereafter, iid Gaussian noise with zero mean and unit variance is added to this field to generate the synthetic measured field (shown in Figure 5c.).
Once the generator of the GAN is trained and the measured field has been computed, we apply the algorithms developed in the previous section to probe the posterior distribution.
We first use these to determine the MAP estimate for the posterior distribution of the latent vector (denoted by ). In order to evaluate this estimate we use a gradient-based algorithm (BFGS) to solve (12). We use 32 distinct initial guesses for , and drive these to values where the gradient, as defined in (13), is small. Of these we select the one with the smallest value of as an approximation to the MAP estimate. The value of is shown in Figure 6b. By comparing this with the true value of the inferred field, shown in Figure 6a, we observe that the MAP estimate is very close to the true value. This agreement is even more remarkable if we recognize that the ratio of noise to signal is around , and also compare the MAP estimates obtained using an or an prior (see Figures 6c and d) with the true value. We note that these estimates are very different from the true value, and in both cases the edges and the corners of the initial field are completely lost. In contrast to this, the MAP estimate from the GAN prior is able to retain these features. This is primarily because these characteristics of the spatial distribution of the initial temperature field, that is a rectangular patch with homogeneous temperature distribution, are embedded in the prior.
Next, we consider the results obtained by sampling from the MCMC approximation to the posterior distribution of . The MCMC approximation was obtained by applying the random walk Metropolis Hastings Metropolis1953; Hastings1970 algorithm to the distribution defined in (7). The proposal density was Gaussian with zero mean and standard deviation equal to 0.005.
The MCMC approximation to the mean of the inferred field computed using (10) is shown in Figure 6f. We observe that the edges and the corners of the temperature field are smeared out. This indicates the uncertainty in recovering the values of the initial field along these locations, which can be attributed to the smoothing nature of the forward operator especially for the higher modes.
The MAP estimate, , was obtained by selecting the sample with largest posterior density among the MCMC samples, and is shown in Figure 6e. We observe that both in its spatial distribution and quantitative value, it is close to the true distribution. Further, by comparing Figure 6e and 6f, we note that while and share some common features, they are not identical. This is to be expected because, in general .
A more precise estimate of the uncertainty in the inferred field can be gleaned by computing the variance of the inferred initial temperature at each spatial location. In Figure 7 we have plotted the point-wise standard deviation (square-root of the diagonal of co-variance) of the inferred field. We observe that the standard deviation is the largest at edges and the corners, where the forward operator has smoothed out the initial data, and thus introduced large levels of uncertainty in the precise location of these features.
In this manuscript we have considered the use of the distribution learned by a GAN as a prior in a Bayesian update. We have demonstrated that with sufficient capacity and training of the GAN, the corresponding posterior tends to be the posterior obtained using the true prior density.
We believe that this approach addresses two challenges that are often encountered in Bayesian inference. First, it facilitates the application of Bayesian inference to cases where the prior is known only through a collection of samples and is difficult to represent through hand-crafted operators. Second, since a typical GAN generates complex distributions for vectors of large dimension by sampling over a much smaller space of latent vectors, it provides an efficient means to sample the posterior distribution.
In order to make these ideas practical we have proposed two strategies to estimate population parameters for the proposed posterior. The first is a simple Monte-Carlo approximation to the corresponding integrals where the samples are chosen using the prior. The second involves a Markov-Chain Monte-Carlo (MCMC) approximation of the posterior distribution in order to generate the appropriate samples. Further, under the assumptions of Gaussian noise and Gaussian latent vector components, we have described a simple gradient-based approach to recover the maximum a-posteriori (MAP) estimate of the posterior distribution.
We have demonstrated the utility of these methods on the simple inverse problem of determining the initial temperature field distribution from a noisy measurement of the temperature field at a later time. In this initial test, we have found that the proposed method works well and holds promise for solving more complex problems.
The work described in this manuscript can be extended along several different avenues. These include (a) applying it to more complex and challenging inverse problems; some with strong nonlinearities, (b) examining the relation between the dimension of the latent vector space and the accuracy of the posterior distribution, (c) the use of other generative machine learning algorithms, such as variation auto-encoders, as priors, and (d) the application of advanced MCMC techniques like a Metropolis-adjusted Langevin algorithm (MALA) Atchade2006 and Hamiltonian Monte-Carlo methods (HMC) Hoffman2014; Brooks2012 for accurately and efficiently sampling the posterior distribution.
Appendix A Architecture details
The architecture of the generator component of the GAN is shown in Figure 8, and the architecture of the discriminator is shown in Figure 9. Some notes regarding the nomenclature used in these figures: