Bayesian Reasoning with Deep-Learned Knowledge

01/29/2020 ∙ by Jakob Knollmüller, et al. ∙ Max Planck Society 93

We access the internalized understanding of trained, deep neural networks to perform Bayesian reasoning on complex tasks. Independently trained networks are arranged to jointly answer questions outside their original scope, which are formulated in terms of a Bayesian inference problem. We solve this approximately with variational inference, which provides uncertainty on the outcomes. We demonstrate how following tasks can be approached this way: Combining independently trained networks to sample from a conditional generator, solving riddles involving multiple constraints simultaneously, and combine deep-learned knowledge with conventional noisy measurements in the context of high-resolution images of human faces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world systems are incomprehensibly complex, evading an explicit mathematical description. One way to approach complexity are deep neural networks. They replace explicit formulations with a flexible architecture and vast amounts of examples, discovering the richness of the system on their own. During the training the networks internalize the relevant structure within their weights to perform their envisioned task. Here we bring together two different kinds of widely-used architectures. The fist are deep feed-forward neural networks used for classification or regression. Once trained, they are a tool to check whether some arbitrary, possibly abstract, feature is present within the system. The second are deep generative neural networks, which learn to create realistic system samples, implicitly requiring a deep understanding of the system. Possible ways to obtain them are Variational Auto-Encoders (VAEs)

(Kingma & Welling, 2013) or Generative Adversarial Nets (GANs) (Goodfellow et al., 2014). Here we want to perform Bayesian reasoning in complex systems represented by generative models subject to abstract constraints implemented through independently trained networks.

This question is also approached by conditional GANs (Mirza & Osindero, 2014). Here a class label is provided during the training, allowing to conditionally sample afterwards. The posed questions have to be known beforehand and enough training data has to be available. Changing the class type afterwards requires re-training the entire network. Also using multiple constraints simultaneously can be an issue due to the involved combinatorics, rendering even large amounts of data small for each combination. With a Bayesian approach we compose unconstrained generators together with independently trained neural networks to check for a certain trait to build Bayesian conditional generators. These allow to ask questions as they arise. This way it is possible to set up non-trivial problems that require reasoning by imposing multiple constraints simultaneously. In our examples we demonstrate such Bayesian conditional generators and how to solve a non-trivial riddle by combining the appropriate constraints.

Deep generative models have been used to describe complex systems in classical (Dong et al., 2015; Jin et al., 2017; Ulyanov et al., 2018) and Bayesian settings (Wu et al., 2018; Böhm et al., 2019) and they perform outstandingly on their tasks. In the cited works conventional measurement data was used to constrain the system. In a Bayesian setting, the generative models can be interpreted to represent prior distributions. Such deep generative priors are especially powerful to express nonlinear correlations within the uncertainty quantification. We want to demonstrate the compatibility of the here presented approach with such setups. We include knowledge on age and gender to improve a high-resolution reconstruction of a severely degraded picture of a human face. For this we combine commonly used networks for their respective task, trained on different datasets.

In all demonstrations we follow a common procedure. We implement the appropriate priors and likelihood distributions, containing the trained networks, to set up a Bayesian inference problem, which we solve approximately via variational inference. The result is then a set of samples from the approximate posterior distribution, which represent the solution to the posed problem, also expressing uncertainty.

2 Bayesian Reasoning

Bayesian reasoning answers the fundamental question on how the knowledge on a system adapts in the light of new information. The prior knowledge is stored within the prior distribution , containing all uncertainties, correlations and features that define the system. Any kind of new information is contained in a likelihood distribution that relates a system state to the observed quantity . The answer is then given by the posterior distribution

, provided by Bayes theorem

(1)

This equation applies to any kind of information and it provides the generalization of Aristotelian logic into the realm of uncertainty (Cox, 1946). The prior distribution allows to implement a detailed description of the system by using deep and complex hierarchical Bayesian models. In the same way, if the relation to the model state is understood, the likelihood will constrain the posterior distribution.

3 Deep-Learned Knowledge

Deep neural networks can capture extremely complex concepts and features by learning from enormous amounts of data. They revolutionized the way how classification and regression problems can be approached. For a given input, a trained network deduces an estimate for the class in the case of classification or of a continuous quantity in the case of regression. Another increasingly popular application of deep neural networks are generative models, which produce samples according to a training set. Both applications require an internalized understanding of the system at hand. This knowledge is stored within the weights of the trained networks. We want to access this knowledge to solve a wide range of different problems without having to re-train the networks. The key to this is the translation of the neural networks to probability distributions. After this we can use Bayes theorem to combine their information and obtain posterior distributions, answering the posed questions.

3.1 Deep-Learned Priors

Generative models, such as GANs and VAEs, represent a probability distribution over some quantity. They generate samples following the distribution of the training data. The sample generation shares a common procedure. First, a random sample is drawn from some simple distribution, e.g. a standard Gaussian . This latent sample is then nonlinearly processed by the trained network to resemble a training realization .

The latent parameters are the effective parametrization of the system, which a priori follow a simple distribution. The entire complexity of the model is stored in the nonlinear transformation within the generator . From the perspective of probabilities this is a standardized model parametrization, corresponding to the reparametrization trick (Kingma & Welling, 2013) applied to the model parameters. This parametrization resolves a number of conceptual and numerical issues in Bayesian inference problems (Betancourt & Girolami, 2015; Kucukelbir et al., 2017). The generative neural network learns to approximate the multivariate distributional transform (Rüschendorf, 2009)

of the distribution underlying the training set. Any hierarchical Bayesian model over continuous variables can be represented in this form. The generative neural network represents the system with independently distributed latent variables combined with a complex, non-linear transformation to the system coordinates. Given any additional information on the system

in form of a likelihood

we impose the generative prior by relating the likelihood to the latent variables through the network. The Bayesian inference problem in terms of the latent model variables with generative priors and standard Gaussian distribution is

(2)

This equation tells us how any kind of new information restricts the latent parameters, corresponding to samples from the constrained system. It is not required that the generator was trained on Gaussian latent samples. If the latent space distribution is known, it can be reparametrized in terms of Gaussian random variables by adding the transformation between the two distributions to the generator.

Figure 1: Representative samples (left to right) from the approximate posterior distribution conditioned on the ten different digits (top to bottom).

3.2 Deep-Learned Constraints

Knowledge on a system is often represented in terms of a list of constraints. Given a system realization , we can use some function to check whether these properties are present and therefore the realization is consistent with our knowledge, or not. In terms of probabilities, this is a statement on the likelihood of the constraints being fulfilled. If we are absolutely certain about some aspects of the system having the continuous value , this likelihood corresponds to the delta-distribution

(3)

An efficient way to obtain functions to check whether the system fulfills even complex properties is deep learning. However, a network will only provide probabilistic answers and estimates, not certainty, to posed questions. The delta-distribution for hard constraints is therefore relaxed towards numerically more convenient likelihoods. This also allows us to express uncertainty in system features, which only might be fulfilled. We can, for example, use a Gaussian likelihood of the classifications or features with mean

and adjust the variance

according for our expectations.

(4)

In the limiting case of vanishing variance, we again approach the delta-distribution.

Another way for classification constraints is to encode our expectations within a multinomial distribution as likelihood.

(5)

This distribution describes the outcome of repetitions of a categorical experiment. The data

is a discrete vector of outcomes for the different categories, summing up to

. Given a system realization, the neural network provides classification probabilities . The way to express our knowledge into a category is to choose data

accordingly. For relatively certain features, the corresponding entry has to be large, and all others small. In the limit of absolute certainty, this approaches infinity and we recover the delta distribution. This likelihood also allows to express high certainty that it is indistinguishable whether the object belongs to one class or another by having large entries for both categories with according odds. We can express uncertainty in a category by choosing a small

and correspondingly low , allowing to encode preferences in certain outcomes, without enforcing those.

1

2

3

4

Figure 2: Architecture of the riddle solver.

4 Bayesian Reasoning with Deep-Learned Knowledge

A deep generative model provides prior knowledge on some system and a deep classification or regression network can check whether a certain abstract property is fulfilled. It is therefore interesting to combine these two to build networks that generate samples via that are subject to specific constraints represented by . From an information theoretical perspective Bayes theorem describes how such a combination should be performed:

(6)

This is a non-conjugate Bayesian inference problem in standard coordinates. Building the composition of both networks directly relates the latent parameters of the generator to the model parameters of the likelihood. This composition is a generative network over the outcomes of . This only makes sense if both networks are trained on a common system, but it is not necessary that the same training set is used. In constructing such combined network logic, one is also not restricted to using only a single likelihood. In fact any kind of information on the system can be added this way. Several abstract properties expressed through deep neural networks for classification or regression can be enforced simultaneously.

Posing the inference in this way, any kind of domain knowledge can be added, as well as any kind of further measurement information. This only requires additional likelihood terms.

Figure 3: Representative samples from the approximate posterior distribution for the first (top row), second (middle row) and third (bottom row) digit to the posed riddle. The top shows the result for reasonably chosen constraint strength, whereas the bottom illustrates the behavior for too strongly imposed constraints.

4.1 Approximate Inference

Due to the highly nonlinear structure in and , the evidence will not be available analytically, so we have to rely on approximations to the posterior distribution. Several approaches are available, but the associated approximation problem will be hard for two reasons. The first one is the dimensionallity of the posterior distribution. We perform the Bayesian inference in terms of the latent parameters of the generator, so the dimension of the posterior will be equivalent to the dimension of the latent space, easily containing hundreds or thousands of parameters. Secondly, the complexity of the posterior distribution. It is shaped by the hardly comprehensible structure imposed by the trained weights.

The simplest approach is a Maximum Posterior Estimate (MAP), corresponding to the location of highest posterior probability. It can cope with high dimensional posterior distributions, but it tends to perform poorly in strongly non-linear problems and is sensitive to any multi-modal feature in the posterior distribution.

Using methods based in MCMC sampling allows to explore the entire posterior distribution, given enough computational resources. Sampling, however, in high dimensions and non-linear problems can be extremely inefficient. We argue that such an accuracy will not be required most of the time, as deep-learned models are themselves only approximations to reality and exhibit unknown biases.

An efficient compromise is variational inference. Here we approximate the true posterior distribution with another one

within a parametrized family. The Kullback-Leibler divergence between the two distributions is minimized with respect to the variational parameters

.

(7)

Capturing complex features of the posterior distribution in high dimensions requires an prohibitively large amount of parameters, making approaches such as Normalizing Flows (Rezende & Mohamed, 2015)] or Gaussian variational inference (Opper & Archambeau, 2009) with a fully parametrized covariance impractical. One way to deal with this is to use a mean-field model, which imposes independence between all posterior parameters and therefore scales linearly with the dimension. We do not want to ignore posterior correlations between all latent parameters.

A method that goes beyond the mean-field approximation and still scales only linearly in computational time and memory is Metric Gaussian Variational Inference (MGVI) (Knollmüller & Enßlin, 2019). It is an iterative method that performs a series of Gaussian variational approximations to find a parameter estimate that is consistent with the covariance estimate. Instead of parametrizing the covariance explicitly, an approximation based on an implicitly stored inverse Fisher metric is used, containing cross-correlations between all posterior parameters. This approximation provides samples from the approximate posterior, which can be used to estimate the mean and uncertainty of the latent variables , but they ignore higher-order statistics, and are especially ignorant towards multi-modality. The result could be seen as one answer, not necessarily the full answer.

We argue that the posterior even might not be too dissimilar to a Gaussian distribution in many cases. A priori the latent parameters follow a Gaussian distribution, so any latent parameter that is not associated to the constrain imposed by the condition will also be Gaussian in the posterior distribution. The generative model maps smoothly between latent parameters and system samples, so similar systems will have similar latent parameters. Useful constraints on the system should affect relevant features, which will be localized in the latent space, occupying a certain volume due to the smoothness. Also Wu et al. (2018) have made similar observations. The multi-modality should not be too bad in many cases and a Gaussian approximation sufficient, providing mean and uncertainty estimates of the posterior latent parameters.

Figure 4: Architecture of the face reconstruction.

5 Demonstrations

We want to give three distinct examples of varying size and complexity of Bayesian reasoning in which we combine deep neural networks trained on classification and regression tasks with deep generative models and other kinds of information, using MGVI to sample from an approximate posterior. The first example is a Bayesian conditional generator for hand-written digits based on the MNIST dataset (LeCun et al., 1998).

The second example extends the first by posing a riddle of finding a set of three hand-written numbers fulfilling multiple constraints simultaneously. On the mathematical level this will be trivial, but here we solve the problem in terms of the latent variables of the generator, which, a priori, have absolutely nothing to do with the abstract mathematical concept of numbers. This example provides a recipe to approach Bayesian reasoning in complex systems subject to complex constraints.

The last example reconstructs of a high resolution image of a face from severely degraded data, while including knowledge on age and gender to improve the image recovery. Here we will have an extremely high-dimensional posterior distribution combined with generators and classifiers of tens of millions trained weights. This example demonstrates the compatibility of classical measurement information with abstract knowledge provided by the deep-learned system. The deep generator and classification networks are trained on distinct training sets. This example shows that the approach is not limited to low-dimensional toy examples and can be used in settings relevant to real-life applications.

All experiments have been implemented in Python using the NIFTy5 (Arras et al., 2019)

and tensorflow

(Abadi et al., 2015) software packages. All optimization steps were performed on an Intel i7 CPU () with

RAM and the network evaluations on a Nvidia GeForce 1080ti. The code has not been highly optimized for this problem and the stated times should serve only as rough orientation. The starting position is always the latent mean perturbed by Gaussian noise with a standard deviation of

.

Figure 5: Setup and results for the face reconstruction problem with ground truth (top left), masked and corrupted data (top right), the mean reconstructed image (bottom left), and the pixel-wise standard deviation (bottom right).

5.1 Bayesian Conditional Generators

Deep conditional generative networks allow to draw samples according to certain specifications. During the training phase, also a class label or feature is shown to the network. and once it is trained, samples conditional to this feature can be drawn. However, one is restricted to the features available in the training set. In terms of probabilities, conditional generative models generate samples given a feature , i.e. represent the posterior distribution . Here we want to approach the same problem expressing the constrain within a likelihood, implemented through a trained classification network and use the generator as prior distribution. The posterior is then approximated in the latent variables.

The negative logarithm of the joint probability for this problem up to parameter-independent constants (indicated by ) is given by

(8)
(9)

Here the first term originates from the negative logarithm of the multinomial distribution of Eq. 5 with the data vector carrying numbers of outcomes of the classes. These are used to encode class expectations (via relative outcome numbers) as well as the certainty (via the total number of outcomes). The second term comes from the Gaussian prior distribution over latent variables.

We will start simple by discussing hand-written digits in the MNIST data set conditioned to a certain label. As generative model we use a Wasserstein-GAN (Arjovsky et al., 2017; Gulrajani et al., 2017) with three hidden layers, convolutional architecture, and

latent variables. The digit classification is performed by a deep three-layer convolutional neural network

(Krizhevsky et al., 2012) trained on cross-entropy and achieving test accuracy. As likelihood in this case we use the multinomial distribution with and all draws according to the demanded category , i.e. and , expressing a certainty in digit . We condition the generator to sample according to the ten class labels and approximate the posterior distribution using MGVI. For this we perform three consecutive approximations, using five pairs of antithetic samples. For each approximation we performed natural gradient steps. For each label this procedure requires a few minutes. The results in form of approximate posterior samples are shown in Fig. 1.

Overall the samples are accurate in most cases and represent well the class-internal variability. The fidelity of the samples are impacted by three different sources. First, how well the generator internalized the system properties. Second, how well the classification network understands the concept of digits, and third, the fidelity of the MGVI approximation to the posterior distribution.

Figure 6: Face samples informed by data, age, and gender.

5.2 Solving Riddles the Bayesian Way

In this second example we want to answer a more elaborate question, but we stay in the context of hand-written digits. We are looking for three single-digit numbers that fulfill the four properties outlined in Tab 1.

There is only one possible solution that fulfills all of these constraints: , , and . Here we want to solve this riddle on the level of the latent variables of hand-written digits by setting up and solving the corresponding Bayesian inference problem. The recipe to do this is straightforward. We have to implement all of these constraints in term of an appropriate likelihood function and combine it with the suitable prior. As prior distributions for the three digits we use three instances of the same generative model as in the previous example, resulting in a total of latent variables. To implement these constraints we use three independently trained convolutional neural networks. We use the same label classification network as in the previous example. We implement the last two constraints via two additional networks to check whether a number is odd () or contain a closed circle (). Here we use the identical convolutional architecture as for the digit classification, but train each remaining network on its respective task. We compose all these classification networks with the generator to translate the latent variables to classification probabilities

(10)

The likelihood corresponding to the first condition requires classification probabilities for all three digits to calculate how likely it is that these three fulfill the equation. Here the mathematical logic of addition directly implemented into the model, represented by the

-tensor

with ones for valid equations and zeros elsewhere

(11)

Contracting this tensor with the three classification probabilities provides the overall probability whether the equation is true or not. This way we included explicit domain knowledge into the reasoning system. Similarly we implement the third constraint using the -tensor . Here the probability of satisfying this condition sum of the first digits’ probabilities multiplied with the second digits’ probabilities that are two larger. The third and fourth constraints are the plain networks trained on the respective tasks task attached to the corresponding digits.

1. The first plus the second number equal the third.
2. The second number is larger than the first by two.
3. The first number is odd.
4. The third number contains a closed circle.
Table 1: The riddle discussed in Sec. 5.2.

This gives us a way to obtain probabilities for all constraints whether they are fulfilled or not. We will use the multinomial likelihood for all statements, where we encode our knowledge that all constraints must be fulfilled within the data vector . We assign all of them counts that the statements is true and none for everything else, not fully enforcing the conditions, but on a level. As we will see, this is necessary to obtain the right answers.

All these likelihoods are multiplied together with the prior probabilities. Up to parameter independent terms, the negative logarithm of the joint probability is

(12)

The first four terms correspond to conditions 1- 4 and the last line is due to the standard Gaussian priors on the latent variables of the digits. A representation of the graphical structure of this problem is given in Fig. 2.

This poses a Bayesian inference problem for latent variables. We perform approximate inference using MGVI, performing approximations, with again five pairs of antithetic samples. For each approximation we performed natural gradient steps to avoid over-fitting on the realizations. Solving this task requires on the order of tens of minutes. The result in form of approximate posterior samples are shown in the top of Fig 3.

The result are the three generators subject to the posed constraints, providing a consistent subset of hand-written digits, corresponding to the answer of the riddle. Most of the samples are correct, but not all. Such errors are vital in finding the correct answer and they are due to the relatively weak enforcement of all constraints. It is instructive to study such inference mistakes. To provoke those, we use the same setup, but strongly enforce the constraints, so that the networks settle down in the first local minimum of Eq. 12 they encounter. A typical behavior is shown in the bottom of Fig. 3, where we increased the confidence in the respective outcomes by a factor of ten. To a human this is not a valid solution, but according to the classifications, all constraints are satisfied. What happened here is that the latent space of the generators could not be sufficiently explored due to the local nature of the approximation and too rigid enforcement of the constraints. The networks solved the problem creatively by generating an ambiguous first digit, which to the label classifier is a , resulting in a for the second digit. The sum of both is , which contains a full circle. In the eyes of the network for odd numbers, the first digit might be a or , fulfilling also the third constraint. The structure of the first digit therefore does not allow for much variance, as deviation towards a more distinguished appearance violates the delicate balance between the two networks. Weaker constraints allows the networks to think more outside the box, i.e. the local minimum. Adapting the strength during the approximation could allow for improved results.

Figure 7: Face samples informed by the data only.

5.3 Bayesian Faces

The last example should reconstruct the image of a face from degraded, noisy, and incomplete data, making use of the additional information of age and gender. For comparison we also condition the generator to only the data and to only age and gender. As generative model of face images , a stylegan network trained on the Flickr-Faces-HQ data set is used (Karras et al., 2019). It generates photo-realistic images of faces in a resolution of pixels from latent parameters. The generative model contains million trained weights. To obtain age and gender estimates we rely on two different networks, and , with the same ResNet- architecture (He et al., 2016) and trained on the IMDB-WIKI dataset (Rothe et al., 2015, 2016). Each of them also contains million trained parameters. The ground truth is drawn from the generative model and degraded in several ways to obtain the data. First, the three color channels are added up to generate a gray-scale image. In a second step the resolution is reduced to pixels via coarse-graining by a factor of , followed by masking the left part of the image. These steps are summarized in the linear degradation operator

. Finally, Gaussian white noise with unit variance is added, providing image data

. We obtain age and gender estimates of the ground truth by applying the corresponding classifier. To account for different input and output shapes, a re-scaling from to pixels is included, linking both types of networks.

The likelihood for the full problem consists of three terms. All of them require the generator applied to the respective operator or network for , as defined in Eq. 10. The data from the degraded image enters via a Gaussian likelihood, containing the degradation operator applied to the generator. For the age prediction we calculate the weighted average provided by the classification probabilities, resulting in a continuous estimate. This enters the problem also by a Gaussian likelihood, centered around the age of the ground truth and assuming a standard deviation of one year. The gender is enforced via a multinomial likelihood with ten data points in favor of the category, corresponding to a certainty on roughly the

level. The negative logarithm of the joint distribution of this problem, up to parameter-independent constants, is

(13)

The corresponding graphical structure is shown in Fig. 4. According to the networks, the ground truth is years old and female. The posterior distribution is approximated using MGVI. We perform ten consecutive approximations with five pairs of antithetic samples and natural gradient steps. In the last approximation we increase the number of samples to . Due to the large amount of required network evaluations, the overall run-time for the problem is roughly 20 hours.

The setup and posterior mean result with variance are shown in Fig. 5. A collection of representative samples from the approximation is shown in Fig. 6

. Some aspects of the ground truth are recovered well. The mean of sample images bares striking similarity to the ground truth, including facial expression, overall position, and the outdoor setting in the background. The pixel-wise standard deviation shows high certainty in the central parts of the face, whereas smaller-scale features such as hairstyle and background-details are washed out. Also the exact end of the face on the left half seems to be more uncertain, as this part is masked. The samples are relatively homogeneous in style, age, and gender, although some might deviate. Estimating the age from a single image is a hard task and the networks are far from perfect, but there is definitely no severe outlier. These samples express highly nonlinear correlations in image space, allowing to quantify complex uncertainties. We recover the underlying truth remarkably well, but note that the original image was generated by the same network used for the reconstruction, so the ground truth is known to the generator.

For the reconstruction using only the image data, the sample are shown in Fig. 7. These illustrate what kind of information is still present in the degraded data. It seems clear that the image shows an outdoor setting and that the person smiles and is most likely female. It does not seem to constrain the age that well, as the visual spread is far larger than the previous samples, including a number of children. This is plausible, as age is mostly associated to small-scale features, which are removed through the degradation. Including the additional information of age and gender allows to reduce the variance in these directions and improves the overall reconstruction.

We can also condition the generator only on age and gender. The samples are shown in Fig 8. All samples show a large variance in setting and style, but they roughly appear to have the same age, and all, except the last sample, seem to be female. We only implemented the gender with roughly certainty, so we are not surprised at all to find one outlier, as it is even expected. In this example we increased the perturbation of the initial position by a factor of , as the immediate vicinity of the prior mean is inhabited with highly untypical image configurations, which are problematic for the networks. This was enough to escape the implausible region and obtain the shown results.

Figure 8: Face samples informed by age and gender only.

6 Conclusion

We demonstrated how to impose complex constraints represented by deep neural networks to complex system represented through deep generative models. This allows to combine already trained networks to perform tasks outside their initial scope. New questions formulated in terms of probabilities can be answered via Bayesian reasoning. The answers are also given in terms of probability distributions, so uncertainty is inherently propagated. With this we built Bayesian conditional generators, elevating the necessity to know arising questions before the training. Combining several constraints, we can build a collection of neural networks that jointly perform reasoning to solve a non-trivial task. The approach is applicable within extremely deep and state-of-the-art architectures and feasible even for high-dimensional posterior distributions. Knowledge on high-level concepts can be included to support reconstructions with conventional measurement data. We hope that this work enables the development of exciting new applications, combining the strengths of deep learning and Bayesian reasoning.

References

  • Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • Arras et al. (2019) Arras, P., Baltac, M., Ensslin, T. A., Frank, P., Hutschenreuter, S., Knollmueller, J., Leike, R., Newrzella, M.-N., Platz, L., Reinecke, M., et al. Nifty5: Numerical information field theory v5. Astrophysics Source Code Library, 2019.
  • Betancourt & Girolami (2015) Betancourt, M. and Girolami, M. Hamiltonian monte carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79:30, 2015.
  • Böhm et al. (2019) Böhm, V., Lanusse, F., and Seljak, U. Uncertainty quantification with generative models. arXiv preprint arXiv:1910.10046, 2019.
  • Cox (1946) Cox, R. T. Probability, frequency and reasonable expectation. American journal of physics, 14(1):1–13, 1946.
  • Dong et al. (2015) Dong, C., Loy, C. C., He, K., and Tang, X.

    Image super-resolution using deep convolutional networks.

    IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777, 2017.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • Jin et al. (2017) Jin, K. H., McCann, M. T., Froustey, E., and Unser, M. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
  • Karras et al. (2019) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Knollmüller & Enßlin (2019) Knollmüller, J. and Enßlin, T. A. Metric gaussian variational inference. arXiv preprint arXiv:1901.11033, 2019.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
  • Kucukelbir et al. (2017) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. Automatic differentiation variational inference. The Journal of Machine Learning Research, 18(1):430–474, 2017.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • Opper & Archambeau (2009) Opper, M. and Archambeau, C. The variational gaussian approximation revisited. Neural computation, 21(3):786–792, 2009.
  • Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
  • Rothe et al. (2015) Rothe, R., Timofte, R., and Gool, L. V. Dex: Deep expectation of apparent age from a single image. In IEEE International Conference on Computer Vision Workshops (ICCVW), December 2015.
  • Rothe et al. (2016) Rothe, R., Timofte, R., and Gool, L. V. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV), July 2016.
  • Rüschendorf (2009) Rüschendorf, L. On the distributional transform, sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139(11):3921–3927, 2009.
  • Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
  • Wu et al. (2018) Wu, G., Domke, J., and Sanner, S. Conditional inference in pre-trained variational autoencoders via cross-coding. arXiv preprint arXiv:1805.07785, 2018.