Assessing Deep Neural Networks as Probability Estimators

11/16/2021
by   Kwo-Sen Kuo, et al.
38

Deep Neural Networks (DNNs) have performed admirably in classification tasks. However, the characterization of their classification uncertainties, required for certain applications, has been lacking. In this work, we investigate the issue by assessing DNNs' ability to estimate conditional probabilities and propose a framework for systematic uncertainty characterization. Denoting the input sample as x and the category as y, the classification task of assigning a category y to a given input x can be reduced to the task of estimating the conditional probabilities p(y|x), as approximated by the DNN at its last layer using the softmax function. Since softmax yields a vector whose elements all fall in the interval (0, 1) and sum to 1, it suggests a probabilistic interpretation to the DNN's outcome. Using synthetic and real-world datasets, we look into the impact of various factors, e.g., probability density f(x) and inter-categorical sparsity, on the precision of DNNs' estimations of p(y|x), and find that the likelihood probability density and the inter-categorical sparsity have greater impacts than the prior probability to DNNs' classification uncertainty.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

page 8

10/11/2020

Is It Time to Redefine the Classification Task for Deep Neural Networks?

Deep neural networks (DNNs) is demonstrated to be vulnerable to the adve...
11/21/2021

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

There has been a rapid advance of custom hardware (HW) for accelerating ...
02/19/2020

Being Bayesian about Categorical Probability

Neural networks utilize the softmax as a building block in classificatio...
08/21/2020

A Survey on Assessing the Generalization Envelope of Deep Neural Networks at Inference Time for Image Classification

Deep Neural Networks (DNNs) achieve state-of-the-art performance on nume...
10/15/2019

Quantifying Classification Uncertainty using Regularized Evidential Neural Networks

Traditional deep neural nets (NNs) have shown the state-of-the-art perfo...
06/01/2020

One Versus all for deep Neural Network Incertitude (OVNNI) quantification

Deep neural networks (DNNs) are powerful learning models yet their resul...
09/17/2018

Uncertainty Propagation in Deep Neural Networks Using Extended Kalman Filtering

Extended Kalman Filtering (EKF) can be used to propagate and quantify in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The potential of deep neural networks (DNNs) has been amply demonstrated with classification. If we denote the input sample as a vector and an -category vector as , a classification task can be viewed as taking the following two steps: 1) it first predicts the conditional probability , and then 2) makes a decision on the category an input sample belongs to based on some specific criteria, such as , i.e. identifying with the largest element of . Although the final classification result is of primary interest, the intermediate result is necessary for scientific applications, in which the characterization of classification uncertainties is desired. However, there is a lack of systematic investigations into this characterization.

A DNN often uses softmax on the output of its last layer. Since softmax yields a vector whose elements all fall in the interval (0,1) and sum to 1, it suggests a probabilistic interpretation of the DNN’s outcome and is used to approximate the probability of the categorical variable

. Typically during training, the labels of all input samples fed into the DNN are in the one-hot format (i.e., each sample is associated with a single category). The DNN learns implicitly by minimizing the cross-entropy between the output and the one-hot label without revealing . The main mechanism for the DNN to capture is through relating local samples of and the frequencies of . The (local) sparsity of in the training dataset, therefor, may limit the capability of the DNN to capture .

We are interested in assessing the quality of the prediction of and exploring potential factors that may impact the performance metric. However, the lack of ground truth makes it difficult to assess the prediction of generated by DNNs. In this paper, we address these challenges with the following main contributions:

  • We propose an innovative generative framework with two paths: one for directly inferring

    assuming Gaussian probability density functions (pdf’s) and one for generating and training a DNN to approximate

    .

  • We conduct extensive and systematic experiments for both 1-D and high-dimensional inputs to gain insights that suggest the likelihood probability density and the inter-categorical sparsity are the more influential factors on the performance metric than the prior probability.

Ii Related Work

We describe works related to ours in this section. We note that the sample labeling process naturally biases the distribution because, under the one-hot convention, existing works tend to ignore the samples that are unsure of. We cannot uncritically assume that the distributions of the labeled samples, regardless of training or testing, accurately represent those of the population in the real world.

Ii-a Estimating Probability using DNNs

Substantial work has been conducted to estimate the underlying probability in training data using DNNs. Based on how the actual probability (density) function is approximated, we may categorize the existing work into two categories: implicit and explicit estimations.

When a model uses implicit estimation, it does not approximate the distribution in a closed form but usually generates samples subject to the distribution. Generative Adversary Network (GAN) 

[7] consists of a generator and a discriminator, which co-evolve to achieve the best performance. The generator implicitly models the distribution of training data, and the discriminator attempts to differentiate between the true distribution and the synthesized distribution from the generator. The generator, however, has not been leveraged to generate samples with prescribed distributions for uncertainty characterization. Ever since its invention, GAN has evolved into a large family of architectures [20, 2, 5, 11, 13, 22, 28, 27, 29].

Explicit estimation attempts to learn the distribution in a closed form. Some pioneering studies [17, 1, 19]

discuss the capability of DNNs to approximate probability distributions. Mixture Density Network (MDN) 

[3] predicts not only the expected value of a target but also the underlying probability distribution. Given an input, MDN extends maximum likelihood by substituting Gaussian pdf with the mixture model. Probabilistic Neural Network (PNN) [24] uses a Parzen window to estimate the probability density for each category and then uses Bayes’ rule to calculate the posterior . PNN is non-parametric in the sense that it does not need any learning process, and at each inference, it uses all training samples as its weights. These techniques do not seem to consider the possibility that the distributions of the labeled samples may deviate from those of the population.

Ii-B Approximate Inference

As the inference process for complex models is usually computationally intractable, one cannot perform inference exactly and must resort to approximation. Approximate inference methods may also be categorized into two categories: sampling and variational inference.

Sampling is a common method to address intractable models. One can draw samples from a model and fit an approximate probabilistic model from the samples. There are classic sampling methods, such as inverse sampling, rejection sampling, and Gibbs sampling, as well as more advanced sampling methods, such as Markov chain Monte Carlo (MCMC)

[6]. Our framework is similar to the sampling method in the sense that it generates samples for training and testing a DNN model.

Variational inference is an alternative to sampling methods. It approximates the original distribution with a fit

, turning it into an optimization problem. Accordingly, variational autoencoder 

[15] approximates the conditional distribution of latent variables given an input using a function by reducing the KL-Divergence between the two distributions. This results in an additional loss component and a specific estimator for the training algorithm called the Stochastic Gradient Variational Bayes (SGVB) estimator. Researchers have incorporated some more sophisticated posteriors to extend variational autoencoder [23, 21, 14].

Ii-C Bayesian Neural Networks

Since deep learning methods typically operate as black boxes, the uncertainty associated with their predictions is challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with DNN predictions. Differing from point estimation used in a typical training process such as Stochastic Gradient Descent, Bayesian neural networks learn a probability distribution over the weights 

[12, 9, 18, 25].

Ii-D Calibration method

Guo et al. [8] translates the problem of calibrating the prediction made by a neural network into counting how many correct predicted samples are made. Their calibration depends on specific dataset whereas we adopt a different calibration mechanism in which our generative model can generate training datasets according to different hyper parameters.

Iii Framework

Fig. 1: Two paths of our assessment framework. A dataset generator is constructed and used to generate synthetic samples. Along the fristpath of Bayesian inference, it is easy to infer the posterior from prescribed prior and likelihood . Along the path of sampling and training, it first samples the class label based on a prescribed discrete distribution of . For 1-D, is sampled according to a prescribed likelihood . For dimension, we first sample a vector according to a prescribed Gaussian pdf in the reduced dimension, which is 2-D for the case of this study. We then map this 2-D vector to a -dimensional vector using our reconstructive mapping and add the pair to the training dataset. Repeating this times, we generate a training dataset containing data samples, which we feed into the DNN for training. When the DNN is fully trained, its predicted probability given any input can be compared with the ground truth .

To assess how well a DNN captures the posterior pdf embedded in a dataset, we must first know the “truth” of the dataset. Yet, given an arbitrary dataset for a typical classification task, it is challenging to estimate the ground truth of the conditional relationship. We end up with a chicken-and-egg situation: We need the “ground truth” to evaluate an estimate, but we can only approximate the ground truth with an estimate. It thus becomes impossible to characterize the classification uncertainty with confidence.

To address this problem, we introduce a new assessment framework to systematically characterize DNNs’ classification uncertainty, as illustrated in Fig. 1. The key idea of our framework is to construct a data generative model by specifying all the information required for the estimation, including the prior distribution and the dependency of on , thus establishing the ground truth. We then proceed along two paths: 1) The first path is through Bayesian inference, in which we directly calculate

through Bayes theorem, i.e.

; and 2) the second path is through generating a dataset using the aforementioned generative model and then training a DNN-based classifier

as an approximation to the “true” and evaluated for its probability approximation ability. The second path is similar to approximate inference by sampling. After the DNN is fully trained, we can compare how close the results of these two paths are, or to say, we can compare the prediction made by the DNN and the “ground truth” from our dataset generator.

In practice, it is non-trivial to directly estimate high-dimensional distributions for many real-world cases. To tackle these cases, we first apply a dimensionality reduction technique, if necessary, to map the high-dimensional input samples to a more manageable low-dimensional space, from which we construct a generative model to generate an extensive set of synthetic samples by densely sampling the reduced-dimensionality space and inversely mapping back to the original high-dimensional space (aka reconstructive mapping). We can thus now sample this extensive dataset of synthetic samples according to prescribed prior and likelihood to serve as ground truth.

Iii-a Framework Formalization

A generative model produces an extensive dataset of synthetical samples, which we sample according to some prescribed prior and likelihood to serve as “ground truth”. For 1-D, the likelihood is represented by a Gaussian pdf . For dimension, we assume is embedded in a lower dimensional manifold for many real world datasets. Thus, the likelihood can be represented by a composite of lower-dimensional Gaussian pdf’s and through a reconstructive mapping function .

For the real-world MNIST dataset studied (see Section

IV-B), we find that x stays essentially on a 2-D manifold, so we have and a reconstructive mapping function . Here any bijective function that maps a 2-D vector back to a -dimensional vector will work as a reconstructive mapping and can be seen as a decoding mapping. We detail our investigations into both the 1-D and high-dimensional cases in Section IV.

Iii-B Two Paths

We detail the two paths of inference used in our framework in the following subsections.

Iii-B1 Bayesisan Inference

Since the synthetic samples produced by the generative model are constrained by prescribed prior and likelihood , we can easily infer based on Bayesian rule for the 1-D case:

(1)

For -dimension (), we may use and to infer :

(2)

Appendix A gives the detailed mathematical proof.

Iii-B2 Data Sampling and Training

We sample, in both 1D and d¿1, class labels based on a prescribed prior of . A 1-D training dataset is generated according to a prescribed likelihood , where is a sample of , with the likelihood assumed to be Gaussian, i.e., . For -dimension, we first sample, according to a 2-D Gaussian pdf distribution , a 2-D vector which we map to a -dimensional vector . Subsequently, we add to the training dataset. Repeating this times, we generate a training dataset containing data samples, which are then fed into a DNN for training. When the DNN is fully trained, we compare the predicted probability of any given input with the ground truth .

Iii-C Comparison

For each configuration of the generative model, we have an inferred and a trained (predicted) . We can compare these two conditional probabilities at each sampled point of . More importantly, we can systematically explore all possible configurations of the generative model and find the main factors affecting the approximation precision of . Given the complexity of this exploration, here we report more comprehensively on only 1-D cases. For -dimensional cases, we use the MNIST dataset, fit its 2-D representation as a Gaussian mixture, and explore this specific configuration.

Iv Experiment and Evaluation

Fig. 2: Prediction precision as the function of various factors. (a) plots the prediction precision in KL-Divergence as the function of the mixing coefficient . (c) plots the prediction precision in KL-Divergence as the function of the mean value of cluster 1,

. (e) plots the prediction precision in KL-Divergence as the function of variances

and . (b), (d), and (f) plot the counterparts of (a), (c), and (e) in Absolute Difference, respectively.

We use Google Colab[4] with Nvidia T4 and P100 GPUs to run our experiment. We discuss below the experiment setups and results for 1-D and high dimensional cases, respectively.

Iv-a One Dimensional Case

To simplify our experiment without loss of generality, we use a mixture of two Gaussian pdf’s as our data generator. Here, we call each Gaussian pdf a cluster. The 1-D generative model can be parameterized as , where , , and , , and are the generative model parameters.

Iv-A1 Systematic Analysis

Fig. 3: Prediction precision as the function of the marginal density and sparsity. Each point represents the situation for one sampled point , among all the sampled points and all the parametric configurations of our generative model. (a) plots the prediction precision in KL-Divergence as the function of density. (b) plots the prediction precision in Absolute Difference as the function of density. (c) plots the prediction precision in KL-Divergence as the function of sparsity. (d) plots the prediction precision in Absolute Difference as the function of sparsity. For each figure, we also divide the x axis into several bins and calculate the average value for each bin. Then we overlay the average values as the function of sampled bins on each figure.
Grid Search in Parametric Space

To explore how each parameter would influence the DNN prediction, we first conduct a grid search in the parametric space , , and of the generative model, where each grid point is a combination of different values of , , and . Here, we sample with step size of , and with a step size of subject to the condition . We also sample both and with a step size of . Therefore, our parametric space is essentially a space containing grid points, which means we generate a training set and then train a fully connected DNN times. For each grid point (i.e., a configuration of the generative model), we generate data samples and labels in a stochastic (random) manner. After the DNN is fully trained, we get the prediction precision of the trained DNN at the sampled with a step size . We choose 35 as it is slightly larger than and can cover the majority area of non-trivial density , where and are the maximum values of and , respectively. Then, we calculate the mean prediction precision at these sampled points. When we plot the prediction precision as a function of each of these three factors, we marginalize the other two factors.

To measure the prediction precision, we use two metrics, KL-Divergence:

(3)

and Absolute Difference:

(4)

Fig. 2 shows the prediction precision as the function of mixing coefficient , the mean value of cluster 1 , and the variances of two clusters and . We can conclude from Fig. 2 that there is only a marginal relationship between the mixing coefficient and the prediction precision. Increasing the distance between the cluster mean values can improve the prediction precision, which we discuss below. The impact of variances is relatively complex. Generally, we see two trends: prediction accuracy decreases with both smaller variance values and with decreasing distance between the two variances. The only exception for the second trend is that, when both variances are equal, the prediction is more accurate than expected from the second trend.

Fig. 4: Large prediction error, configurations, and results. (a) and (b) plot the probability density of clusters 1 and 2: and , respectively. (c) and (d) plot the predicted and true for clusters 1 and 2, respectively. We can see a large discrepancy happens when . (e) plots the density, sparsity, and generated training dataset for both clusters. (f) plots the prediction precision in KL-Divergence and Absolute Difference. (e) shows the density drops and the sparsity increases drastically when , coinciding with the sudden increase of the prediction error in (f).

Fig. 5: Small prediction error: configurations and results. (a) and (b) plot the probability density of clusters 1 and 2: and , respectively. (c) and (d) plot the predicted and true for clusters 1 and 2, respectively. (e) plots the density, sparsity, and generated training dataset for both clusters. (f) plots the prediction precision in KL-Divergence and Absolute Difference. (c) shows the sparsity is high and density is low for and , but the prediction error in (f) stays low for all .
Considering Density and Sparsity as Influencing Factors

Second, for a grid point (i.e., a parametric configuration), we record the prediction precision, together with and () sampled from with a step size of . Thus, we collect points in total. As we define the prediction precision at each sampled using KL-Divergence and Absolute Difference, we can plot the scattered points illustrating the relationship between the prediction precision and two extra potential influencing factors: density and sparsity, which are defined as and , respectively. Sparsity measures how much each cluster relatively contributes to the whole density. Fig. 3 illustrates the prediction precision as the function of density and sparsity. We can see that the prediction precision roughly obeys the power-law with a large exponent for both density and sparsity. The prediction precision decreases drastically as the density increases while it increases drastically as the sparsity increases. This means that most prediction failures are observed when the density is low AND the sparsity is high. This condition is usually satisfied at the far outer side of both clusters or in between two clusters when their variances are low. With this observation and revisiting Fig. 2 (c) and (d), we can see that when increases, the mean of cluster 2 decreases. Thus, as these two clusters’ distance increases, it becomes easier to satisfy the failure condition at a certain sampled . Similarly, in Fig. 2 (e) and (f), when the variances of both clusters are small, the failure condition tends to happen more frequently.

Iv-A2 Example Cases with Specific Configurations

To further illustrate our observation in Section IV-A1, we select two specific configurations with large and small prediction errors from the results in Fig. 3. For each configuration, we follow our assessment framework: generate a training dataset, train a DNN-based classifier, and compare the prediction made by the trained classifier with the truth induced by our data generator.

Configuration with Large Prediction Error

In this configuration, for cluster 1, we have its mean as , variance as , and mixing coefficient as . For cluster 2, we have its mean as , variance as , and mixing coefficient as . Fig. 4 shows the parametric configurations and experiment results for this setting. We can see that large prediction errors occur when , which coincides with low density and high sparsity. This observation is in accordance with the results from our grid search, which showed most prediction failures are observed with low density and high sparsity.

(a)
(b)
(c)
Fig. 6: 2-D data and digit planes. (a) shows the t-SNE dimension reduction result of the MNIST training dataset on a 2-D plane where a elliptical circle surrounds each category and shows the area covered by of the data samples for each digit. (b) shows the synthetic training data for each category on the 2-D plane. (c) shows the corresponding digit for each grid point on the 2-D plane.
Fig. 7: , true , and predicted shown on 2-D plane. The first row shows the distribution density

for each digit category. The second and third rows illustrate the true posterior probability

and the predicted , respectively. The last row shows the pixel-wise absolute difference between the second and third rows.
Configuration with Small Prediction Error

In this configuration, we have the mean values of two clusters at and , respectively. Each cluster has a variance of and a mixing coefficient of 0.5. Fig. 5 shows the parametric configurations and experiment results for this setting. We can see that even though the regions of and satisfy low density and high sparsity, the prediction error stays low. This observation does not, however, contradict the experiment results from our grid search, because low density and high sparsity are necessary but not sufficient conditions for high prediction errors.

(a)
(b)
(c)
(d)
Fig. 8: Density, sparsity, and prediction precision on 2-D plane. (a) shows the probability density . (b) shows the sparsity for each 2-D data point . (c) and (d) show the prediction error in KL-Divergence and Absolute Difference on the 2-D plane, respectively.
(a)
(b)
(c)
(d)
Fig. 9: Prediction precision as the function of density and sparsity. (a) and (b) show the density’s impact on the prediction precision in terms of KL-Divergence and Absolute Difference. (c) and (d) show the sparsity’s influence on the prediction precision in terms of KL-Divergence and Absolute Difference. For each plot, we also divide the x axis into several bins and calculate the average value for each bin. Then, we overlay the average values as the function of sampled bins on each figure.

Iv-B Pseudo High Dimensional Case

To investigate whether these conclusions continue to hold in high dimensional cases, we start as in the 1-D case but using mixture of ten 2-D Gaussians as the data generator. After the random samples on the 2-D plane are acquired, we use a reconstructing mapping function to map a 2-D random sample to a -dimensional sample . The high dimensional generative model can be parameterized similar to the 1-D case in Section IV-A: , where , , and .

We get by reversing the t-SNE [26] dimension reduction of the MNIST dataset[16]. We first apply t-SNE on the MNIST dataset to obtain a set of data samples on a 2-D manifold and train a DNN to map these 2-D samples back to the original MNIST images. Thus, this DNN acts as our . We illustrate the training process of in more detail in Appendix A. Since we consider the MNIST dataset embedded in a 2-D manifold and the DNN is continuous, we assume our is smooth and bijective. Here, we could also use other generative functionsm, such as the decoder part of an autoencoder or GAN, as . Still, as shown in Fig. 6 (a), a single Gaussian pdf for each digit category on the 2-D plane acquired by the t-SNE reduction is adequate. The first row of Fig. 7 illustrates the conditional distribution for each categorical variable . In our experiment, we set each . Thus, we can see that the marginal density function in Fig. 8 (a) is a simple addition of the first row of Fig. 7 (a). Once we calculate all the parameters for each Gaussian pdf, we can generate our training dataset as shown in Fig. 6 (b), and use to map the 2-D samples back to the original image space. Fig. 6 (c) shows how each 2-D point maps to the original image space.

After generating the synthetic training dataset with

samples, we train a convolutional neural network to classify the digits. The second and third rows of Fig.

7 show the true and the predicted on the 2-D plane, respectively. From the last row of Fig. 7, we can see that the prediction is generally accurate from the shape of the light colored area for each digit.

Similarly to Section IV-A, we want to find the influencing factors of prediction precision. Here, we focus on density and sparsity. As it is difficult to estimate the actual and when considering density and sparsity, we use and instead. Thus, the density is and we adopt sparsity [10]:

(5)

We still use KL-Divergence and Absolute Difference as the metrics of prediction precision. Fig. 8 shows density, sparsity, and prediction precision on the 2-D plane, and their correlation. Generally speaking, the areas of high prediction error (i.e., the light-colored area in (c) and (b)) correspond to the areas of low density and high sparsity. Fig. 9 further shows the prediction precision as the function of density and sparsity. Again, we get a similar conclusion as 1-D cases that the prediction precision follows the power law as the function of density and sparsity. We choose three 1-D paths on the 2-D plane to illustrate our conclusion in detail in Appendix A.

V Conclusion

We design an innovative framework for characterizing the uncertainty associated with a DNN’s approximation to the conditional distribution of its underlying training dataset. We develop a two-path evaluation paradigm in that we use Bayesian inference to obtain the theoretical ground truth based on a generative model and use the sampling and training to acquire a DNN based classifier. We compare the prediction made by the fully trained DNN with the theoretical ground truth and evaluate its capability as a probability distribution estimator. We conduct extensive experiments for 1-D and high-dimensional cases. For both cases, we come to similar conclusions that most prediction failures made by DNN are observed when the (local) data density is low and the inter-categorical sparsity is high, and the prior probability has less impacts to DNNs’ classification uncertainty. This insight may help delineate the capability of DNNs as probability estimators and aid the interpretation of the inference produced by various deep models. Interesting areas for future research include the application of our framework to more complex categorization scenarios requiring more sophisticated generative models and remappings.

Vi Acknowledgment

This research has been sponsored in part by the National Science Foundation grant IIS-1652846.

Appendix A

Proof for Section III-B1: As we need to map a 2-D vector to -dimensional vector using a reconstructing mapping, we assume is a composite mapping function: , which is composed of a continuous bijective function and a locally isometric function . First, we can infer as

(6)

Then, for the continuous bijection , similarly to Eq. 6 we have

(7)

where . As is continuous and bijective, we have and , where is a Jaccobian determinant of . Therefore, according to Eq. 6 and 7 we have:

(8)

Again, for the locally isometric mapping , for a -dimensional vector , we have:

(9)

As is locally isometric, all probability density functions on and should change proportionally. Hence, we have and , where is one dimensional differential and is the dimension of . Again, according to Eq. 7 and 9 we have

(10)

Thus, we have based on Eq. 8, 10, and 6, meaning we can make use of and to infer .

Fig. 10: Three example paths.

1-D Paths for High Dimensional Case: To illustrate the conclusion about how density and sparsity correlate with prediction precision in more detail, we can select any path on the 2-D plane for high dimensional cases and show the digit-wise prediction, density, sparsity, and prediction precision along each path. Fig. 10 shows three example paths on the 2-D digit plane. Fig. 11 shows the statistics for Path C. From the second last row in Fig. 11, we can see that the density usually correlates with sparsity, while we assume density and sparsity should be independent of each other. We observe this phenomenon also along other paths and attribute it to the specific configuration of our 2-D Gaussian Mixture (In Section III-C, we explain the reason why we only consider a specific configuration for the high dimensional case). For 1-D, when we conduct systematical grid search in the parametric space, we remove this unwanted correlation between density and sparsity (see Fig. 4 (e), in which there is no correlation between density and sparsity).

Fig. 11: Path C goes from to through and on the digit plane with sampled images. The prediction and the theoretical truth for are plotted digit-wisely. The probability gradually decreases from 1 to 0 and has an opposite trend. From Image to , the probability of rises and falls. From Image to , the probability of rises and falls. From Image to , the probability of fluctuates slightly, indicating these images somewhat resemble . The second last row plots the density and the sparsity along this path and the last row plots the prediction precision in KL-Divergence and Absolute Difference. Three error peaks at Images , , and , coinciding with three valleys of the density and the peaks of sparsity.

(a)                                  (b)

Fig. 12: (a) Mean square error for training and validation of for 1000 iterations. (b) Comparison of the original MNIST digits and reconstructed digits by , where the examples of original MNIST digits and the reconstructed ones are illustrated in alternating rows starting with the original.

Training of : We first apply t-SNE on the MNIST dataset to get a set of 2-D data samples and then train a fully connected DNN to map these 2-D samples back to the original MNIST images. We show the training of in Fig. 12 (a) and the efficacy of the reconstruction mapping in Fig. 12 (b).

References

  • [1] A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations.

    The Journal of Machine Learning Research

    19 (1), pp. 1947–1980.
    Cited by: §II-A.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. Cited by: §II-A.
  • [3] C. M. Bishop (1994) Mixture density networks. Technical Report, Aston University, Birmingham. Cited by: §II-A.
  • [4] E. Bisong (2019) Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform, pp. 59–64. Cited by: §IV.
  • [5] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §II-A.
  • [6] C. J. Geyer (1992) Practical markov chain monte carlo. Statistical science, pp. 473–483. Cited by: §II-B.
  • [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §II-A.
  • [8] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §II-D.
  • [9] J. M. Hernández-Lobato and R. Adams (2015)

    Probabilistic backpropagation for scalable learning of bayesian neural networks

    .
    In International Conference on Machine Learning, pp. 1861–1869. Cited by: §II-C.
  • [10] N. Hurley and S. Rickard (2009) Comparing measures of sparsity. IEEE Transactions on Information Theory 55 (10), pp. 4723–4741. Cited by: §IV-B.
  • [11] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1125–1134. Cited by: §II-A.
  • [12] L. V. Jospin, W. Buntine, F. Boussaid, H. Laga, and M. Bennamoun (2020) Hands-on bayesian neural networks–a tutorial for deep learning users. arXiv preprint arXiv:2007.06823. Cited by: §II-C.
  • [13] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §II-A.
  • [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934. Cited by: §II-B.
  • [15] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II-B.
  • [16] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §IV-B.
  • [17] Y. Lu and J. Lu (2020) A universal approximation theorem of deep neural networks for expressing distributions. arXiv preprint arXiv:2004.08867. Cited by: §II-A.
  • [18] D. J. MacKay (1995) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 354 (1), pp. 73–80. Cited by: §II-C.
  • [19] M. Magdon-Ismail and A. Atiya (1999) Neural networks for density estimation. Cited by: §II-A.
  • [20] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §II-A.
  • [21] E. Nalisnick, L. Hertel, and P. Smyth (2016) Approximate inference for deep latent gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, Vol. 2, pp. 131. Cited by: §II-B.
  • [22] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §II-A.
  • [23] D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278–1286. Cited by: §II-B.
  • [24] D. F. Specht (1990) Probabilistic neural networks. Neural networks 3 (1), pp. 109–118. Cited by: §II-A.
  • [25] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter (2016) Bayesian optimization with robust bayesian neural networks. Advances in neural information processing systems 29, pp. 4134–4142. Cited by: §II-C.
  • [26] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §IV-B.
  • [27] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In International conference on machine learning, pp. 7354–7363. Cited by: §II-A.
  • [28] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §II-A.
  • [29] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §II-A.