I Introduction
The potential of deep neural networks (DNNs) has been amply demonstrated with classification. If we denote the input sample as a vector and an category vector as , a classification task can be viewed as taking the following two steps: 1) it first predicts the conditional probability , and then 2) makes a decision on the category an input sample belongs to based on some specific criteria, such as , i.e. identifying with the largest element of . Although the final classification result is of primary interest, the intermediate result is necessary for scientific applications, in which the characterization of classification uncertainties is desired. However, there is a lack of systematic investigations into this characterization.
A DNN often uses softmax on the output of its last layer. Since softmax yields a vector whose elements all fall in the interval (0,1) and sum to 1, it suggests a probabilistic interpretation of the DNN’s outcome and is used to approximate the probability of the categorical variable
. Typically during training, the labels of all input samples fed into the DNN are in the onehot format (i.e., each sample is associated with a single category). The DNN learns implicitly by minimizing the crossentropy between the output and the onehot label without revealing . The main mechanism for the DNN to capture is through relating local samples of and the frequencies of . The (local) sparsity of in the training dataset, therefor, may limit the capability of the DNN to capture .We are interested in assessing the quality of the prediction of and exploring potential factors that may impact the performance metric. However, the lack of ground truth makes it difficult to assess the prediction of generated by DNNs. In this paper, we address these challenges with the following main contributions:

We propose an innovative generative framework with two paths: one for directly inferring
assuming Gaussian probability density functions (pdf’s) and one for generating and training a DNN to approximate
. 
We conduct extensive and systematic experiments for both 1D and highdimensional inputs to gain insights that suggest the likelihood probability density and the intercategorical sparsity are the more influential factors on the performance metric than the prior probability.
Ii Related Work
We describe works related to ours in this section. We note that the sample labeling process naturally biases the distribution because, under the onehot convention, existing works tend to ignore the samples that are unsure of. We cannot uncritically assume that the distributions of the labeled samples, regardless of training or testing, accurately represent those of the population in the real world.
Iia Estimating Probability using DNNs
Substantial work has been conducted to estimate the underlying probability in training data using DNNs. Based on how the actual probability (density) function is approximated, we may categorize the existing work into two categories: implicit and explicit estimations.
When a model uses implicit estimation, it does not approximate the distribution in a closed form but usually generates samples subject to the distribution. Generative Adversary Network (GAN)
[7] consists of a generator and a discriminator, which coevolve to achieve the best performance. The generator implicitly models the distribution of training data, and the discriminator attempts to differentiate between the true distribution and the synthesized distribution from the generator. The generator, however, has not been leveraged to generate samples with prescribed distributions for uncertainty characterization. Ever since its invention, GAN has evolved into a large family of architectures [20, 2, 5, 11, 13, 22, 28, 27, 29].Explicit estimation attempts to learn the distribution in a closed form. Some pioneering studies [17, 1, 19]
discuss the capability of DNNs to approximate probability distributions. Mixture Density Network (MDN)
[3] predicts not only the expected value of a target but also the underlying probability distribution. Given an input, MDN extends maximum likelihood by substituting Gaussian pdf with the mixture model. Probabilistic Neural Network (PNN) [24] uses a Parzen window to estimate the probability density for each category and then uses Bayes’ rule to calculate the posterior . PNN is nonparametric in the sense that it does not need any learning process, and at each inference, it uses all training samples as its weights. These techniques do not seem to consider the possibility that the distributions of the labeled samples may deviate from those of the population.IiB Approximate Inference
As the inference process for complex models is usually computationally intractable, one cannot perform inference exactly and must resort to approximation. Approximate inference methods may also be categorized into two categories: sampling and variational inference.
Sampling is a common method to address intractable models. One can draw samples from a model and fit an approximate probabilistic model from the samples. There are classic sampling methods, such as inverse sampling, rejection sampling, and Gibbs sampling, as well as more advanced sampling methods, such as Markov chain Monte Carlo (MCMC)
[6]. Our framework is similar to the sampling method in the sense that it generates samples for training and testing a DNN model.Variational inference is an alternative to sampling methods. It approximates the original distribution with a fit
, turning it into an optimization problem. Accordingly, variational autoencoder
[15] approximates the conditional distribution of latent variables given an input using a function by reducing the KLDivergence between the two distributions. This results in an additional loss component and a specific estimator for the training algorithm called the Stochastic Gradient Variational Bayes (SGVB) estimator. Researchers have incorporated some more sophisticated posteriors to extend variational autoencoder [23, 21, 14].IiC Bayesian Neural Networks
Since deep learning methods typically operate as black boxes, the uncertainty associated with their predictions is challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with DNN predictions. Differing from point estimation used in a typical training process such as Stochastic Gradient Descent, Bayesian neural networks learn a probability distribution over the weights
[12, 9, 18, 25].IiD Calibration method
Guo et al. [8] translates the problem of calibrating the prediction made by a neural network into counting how many correct predicted samples are made. Their calibration depends on specific dataset whereas we adopt a different calibration mechanism in which our generative model can generate training datasets according to different hyper parameters.
Iii Framework
To assess how well a DNN captures the posterior pdf embedded in a dataset, we must first know the “truth” of the dataset. Yet, given an arbitrary dataset for a typical classification task, it is challenging to estimate the ground truth of the conditional relationship. We end up with a chickenandegg situation: We need the “ground truth” to evaluate an estimate, but we can only approximate the ground truth with an estimate. It thus becomes impossible to characterize the classification uncertainty with confidence.
To address this problem, we introduce a new assessment framework to systematically characterize DNNs’ classification uncertainty, as illustrated in Fig. 1. The key idea of our framework is to construct a data generative model by specifying all the information required for the estimation, including the prior distribution and the dependency of on , thus establishing the ground truth. We then proceed along two paths: 1) The first path is through Bayesian inference, in which we directly calculate
through Bayes theorem, i.e.
; and 2) the second path is through generating a dataset using the aforementioned generative model and then training a DNNbased classifier
as an approximation to the “true” and evaluated for its probability approximation ability. The second path is similar to approximate inference by sampling. After the DNN is fully trained, we can compare how close the results of these two paths are, or to say, we can compare the prediction made by the DNN and the “ground truth” from our dataset generator.In practice, it is nontrivial to directly estimate highdimensional distributions for many realworld cases. To tackle these cases, we first apply a dimensionality reduction technique, if necessary, to map the highdimensional input samples to a more manageable lowdimensional space, from which we construct a generative model to generate an extensive set of synthetic samples by densely sampling the reduceddimensionality space and inversely mapping back to the original highdimensional space (aka reconstructive mapping). We can thus now sample this extensive dataset of synthetic samples according to prescribed prior and likelihood to serve as ground truth.
Iiia Framework Formalization
A generative model produces an extensive dataset of synthetical samples, which we sample according to some prescribed prior and likelihood to serve as “ground truth”. For 1D, the likelihood is represented by a Gaussian pdf . For dimension, we assume is embedded in a lower dimensional manifold for many real world datasets. Thus, the likelihood can be represented by a composite of lowerdimensional Gaussian pdf’s and through a reconstructive mapping function .
For the realworld MNIST dataset studied (see Section
IVB), we find that x stays essentially on a 2D manifold, so we have and a reconstructive mapping function . Here any bijective function that maps a 2D vector back to a dimensional vector will work as a reconstructive mapping and can be seen as a decoding mapping. We detail our investigations into both the 1D and highdimensional cases in Section IV.IiiB Two Paths
We detail the two paths of inference used in our framework in the following subsections.
IiiB1 Bayesisan Inference
Since the synthetic samples produced by the generative model are constrained by prescribed prior and likelihood , we can easily infer based on Bayesian rule for the 1D case:
(1) 
For dimension (), we may use and to infer :
(2) 
Appendix A gives the detailed mathematical proof.
IiiB2 Data Sampling and Training
We sample, in both 1D and d¿1, class labels based on a prescribed prior of . A 1D training dataset is generated according to a prescribed likelihood , where is a sample of , with the likelihood assumed to be Gaussian, i.e., . For dimension, we first sample, according to a 2D Gaussian pdf distribution , a 2D vector which we map to a dimensional vector . Subsequently, we add to the training dataset. Repeating this times, we generate a training dataset containing data samples, which are then fed into a DNN for training. When the DNN is fully trained, we compare the predicted probability of any given input with the ground truth .
IiiC Comparison
For each configuration of the generative model, we have an inferred and a trained (predicted) . We can compare these two conditional probabilities at each sampled point of . More importantly, we can systematically explore all possible configurations of the generative model and find the main factors affecting the approximation precision of . Given the complexity of this exploration, here we report more comprehensively on only 1D cases. For dimensional cases, we use the MNIST dataset, fit its 2D representation as a Gaussian mixture, and explore this specific configuration.
Iv Experiment and Evaluation
We use Google Colab[4] with Nvidia T4 and P100 GPUs to run our experiment. We discuss below the experiment setups and results for 1D and high dimensional cases, respectively.
Iva One Dimensional Case
To simplify our experiment without loss of generality, we use a mixture of two Gaussian pdf’s as our data generator. Here, we call each Gaussian pdf a cluster. The 1D generative model can be parameterized as , where , , and , , and are the generative model parameters.
IvA1 Systematic Analysis
Grid Search in Parametric Space
To explore how each parameter would influence the DNN prediction, we first conduct a grid search in the parametric space , , and of the generative model, where each grid point is a combination of different values of , , and . Here, we sample with step size of , and with a step size of subject to the condition . We also sample both and with a step size of . Therefore, our parametric space is essentially a space containing grid points, which means we generate a training set and then train a fully connected DNN times. For each grid point (i.e., a configuration of the generative model), we generate data samples and labels in a stochastic (random) manner. After the DNN is fully trained, we get the prediction precision of the trained DNN at the sampled with a step size . We choose 35 as it is slightly larger than and can cover the majority area of nontrivial density , where and are the maximum values of and , respectively. Then, we calculate the mean prediction precision at these sampled points. When we plot the prediction precision as a function of each of these three factors, we marginalize the other two factors.
To measure the prediction precision, we use two metrics, KLDivergence:
(3) 
and Absolute Difference:
(4) 
Fig. 2 shows the prediction precision as the function of mixing coefficient , the mean value of cluster 1 , and the variances of two clusters and . We can conclude from Fig. 2 that there is only a marginal relationship between the mixing coefficient and the prediction precision. Increasing the distance between the cluster mean values can improve the prediction precision, which we discuss below. The impact of variances is relatively complex. Generally, we see two trends: prediction accuracy decreases with both smaller variance values and with decreasing distance between the two variances. The only exception for the second trend is that, when both variances are equal, the prediction is more accurate than expected from the second trend.
Considering Density and Sparsity as Influencing Factors
Second, for a grid point (i.e., a parametric configuration), we record the prediction precision, together with and () sampled from with a step size of . Thus, we collect points in total. As we define the prediction precision at each sampled using KLDivergence and Absolute Difference, we can plot the scattered points illustrating the relationship between the prediction precision and two extra potential influencing factors: density and sparsity, which are defined as and , respectively. Sparsity measures how much each cluster relatively contributes to the whole density. Fig. 3 illustrates the prediction precision as the function of density and sparsity. We can see that the prediction precision roughly obeys the powerlaw with a large exponent for both density and sparsity. The prediction precision decreases drastically as the density increases while it increases drastically as the sparsity increases. This means that most prediction failures are observed when the density is low AND the sparsity is high. This condition is usually satisfied at the far outer side of both clusters or in between two clusters when their variances are low. With this observation and revisiting Fig. 2 (c) and (d), we can see that when increases, the mean of cluster 2 decreases. Thus, as these two clusters’ distance increases, it becomes easier to satisfy the failure condition at a certain sampled . Similarly, in Fig. 2 (e) and (f), when the variances of both clusters are small, the failure condition tends to happen more frequently.
IvA2 Example Cases with Specific Configurations
To further illustrate our observation in Section IVA1, we select two specific configurations with large and small prediction errors from the results in Fig. 3. For each configuration, we follow our assessment framework: generate a training dataset, train a DNNbased classifier, and compare the prediction made by the trained classifier with the truth induced by our data generator.
Configuration with Large Prediction Error
In this configuration, for cluster 1, we have its mean as , variance as , and mixing coefficient as . For cluster 2, we have its mean as , variance as , and mixing coefficient as . Fig. 4 shows the parametric configurations and experiment results for this setting. We can see that large prediction errors occur when , which coincides with low density and high sparsity. This observation is in accordance with the results from our grid search, which showed most prediction failures are observed with low density and high sparsity.
for each digit category. The second and third rows illustrate the true posterior probability
and the predicted , respectively. The last row shows the pixelwise absolute difference between the second and third rows.Configuration with Small Prediction Error
In this configuration, we have the mean values of two clusters at and , respectively. Each cluster has a variance of and a mixing coefficient of 0.5. Fig. 5 shows the parametric configurations and experiment results for this setting. We can see that even though the regions of and satisfy low density and high sparsity, the prediction error stays low. This observation does not, however, contradict the experiment results from our grid search, because low density and high sparsity are necessary but not sufficient conditions for high prediction errors.
IvB Pseudo High Dimensional Case
To investigate whether these conclusions continue to hold in high dimensional cases, we start as in the 1D case but using mixture of ten 2D Gaussians as the data generator. After the random samples on the 2D plane are acquired, we use a reconstructing mapping function to map a 2D random sample to a dimensional sample . The high dimensional generative model can be parameterized similar to the 1D case in Section IVA: , where , , and .
We get by reversing the tSNE [26] dimension reduction of the MNIST dataset[16]. We first apply tSNE on the MNIST dataset to obtain a set of data samples on a 2D manifold and train a DNN to map these 2D samples back to the original MNIST images. Thus, this DNN acts as our . We illustrate the training process of in more detail in Appendix A. Since we consider the MNIST dataset embedded in a 2D manifold and the DNN is continuous, we assume our is smooth and bijective. Here, we could also use other generative functionsm, such as the decoder part of an autoencoder or GAN, as . Still, as shown in Fig. 6 (a), a single Gaussian pdf for each digit category on the 2D plane acquired by the tSNE reduction is adequate. The first row of Fig. 7 illustrates the conditional distribution for each categorical variable . In our experiment, we set each . Thus, we can see that the marginal density function in Fig. 8 (a) is a simple addition of the first row of Fig. 7 (a). Once we calculate all the parameters for each Gaussian pdf, we can generate our training dataset as shown in Fig. 6 (b), and use to map the 2D samples back to the original image space. Fig. 6 (c) shows how each 2D point maps to the original image space.
After generating the synthetic training dataset with
samples, we train a convolutional neural network to classify the digits. The second and third rows of Fig.
7 show the true and the predicted on the 2D plane, respectively. From the last row of Fig. 7, we can see that the prediction is generally accurate from the shape of the light colored area for each digit.Similarly to Section IVA, we want to find the influencing factors of prediction precision. Here, we focus on density and sparsity. As it is difficult to estimate the actual and when considering density and sparsity, we use and instead. Thus, the density is and we adopt sparsity [10]:
(5) 
We still use KLDivergence and Absolute Difference as the metrics of prediction precision. Fig. 8 shows density, sparsity, and prediction precision on the 2D plane, and their correlation. Generally speaking, the areas of high prediction error (i.e., the lightcolored area in (c) and (b)) correspond to the areas of low density and high sparsity. Fig. 9 further shows the prediction precision as the function of density and sparsity. Again, we get a similar conclusion as 1D cases that the prediction precision follows the power law as the function of density and sparsity. We choose three 1D paths on the 2D plane to illustrate our conclusion in detail in Appendix A.
V Conclusion
We design an innovative framework for characterizing the uncertainty associated with a DNN’s approximation to the conditional distribution of its underlying training dataset. We develop a twopath evaluation paradigm in that we use Bayesian inference to obtain the theoretical ground truth based on a generative model and use the sampling and training to acquire a DNN based classifier. We compare the prediction made by the fully trained DNN with the theoretical ground truth and evaluate its capability as a probability distribution estimator. We conduct extensive experiments for 1D and highdimensional cases. For both cases, we come to similar conclusions that most prediction failures made by DNN are observed when the (local) data density is low and the intercategorical sparsity is high, and the prior probability has less impacts to DNNs’ classification uncertainty. This insight may help delineate the capability of DNNs as probability estimators and aid the interpretation of the inference produced by various deep models. Interesting areas for future research include the application of our framework to more complex categorization scenarios requiring more sophisticated generative models and remappings.
Vi Acknowledgment
This research has been sponsored in part by the National Science Foundation grant IIS1652846.
Appendix A
Proof for Section IIIB1: As we need to map a 2D vector to dimensional vector using a reconstructing mapping, we assume is a composite mapping function: , which is composed of a continuous bijective function and a locally isometric function . First, we can infer as
(6) 
Then, for the continuous bijection , similarly to Eq. 6 we have
(7) 
where . As is continuous and bijective, we have and , where is a Jaccobian determinant of . Therefore, according to Eq. 6 and 7 we have:
(8) 
Again, for the locally isometric mapping , for a dimensional vector , we have:
(9) 
As is locally isometric, all probability density functions on and should change proportionally. Hence, we have and , where is one dimensional differential and is the dimension of . Again, according to Eq. 7 and 9 we have
(10) 
Thus, we have based on Eq. 8, 10, and 6, meaning we can make use of and to infer .
1D Paths for High Dimensional Case: To illustrate the conclusion about how density and sparsity correlate with prediction precision in more detail, we can select any path on the 2D plane for high dimensional cases and show the digitwise prediction, density, sparsity, and prediction precision along each path. Fig. 10 shows three example paths on the 2D digit plane. Fig. 11 shows the statistics for Path C. From the second last row in Fig. 11, we can see that the density usually correlates with sparsity, while we assume density and sparsity should be independent of each other. We observe this phenomenon also along other paths and attribute it to the specific configuration of our 2D Gaussian Mixture (In Section IIIC, we explain the reason why we only consider a specific configuration for the high dimensional case). For 1D, when we conduct systematical grid search in the parametric space, we remove this unwanted correlation between density and sparsity (see Fig. 4 (e), in which there is no correlation between density and sparsity).
References

[1]
(2018)
Emergence of invariance and disentanglement in deep representations.
The Journal of Machine Learning Research
19 (1), pp. 1947–1980. Cited by: §IIA.  [2] (2017) Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. Cited by: §IIA.
 [3] (1994) Mixture density networks. Technical Report, Aston University, Birmingham. Cited by: §IIA.
 [4] (2019) Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform, pp. 59–64. Cited by: §IV.
 [5] (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §IIA.
 [6] (1992) Practical markov chain monte carlo. Statistical science, pp. 473–483. Cited by: §IIB.
 [7] (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §IIA.
 [8] (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §IID.

[9]
(2015)
Probabilistic backpropagation for scalable learning of bayesian neural networks
. In International Conference on Machine Learning, pp. 1861–1869. Cited by: §IIC.  [10] (2009) Comparing measures of sparsity. IEEE Transactions on Information Theory 55 (10), pp. 4723–4741. Cited by: §IVB.

[11]
(2017)
Imagetoimage translation with conditional adversarial networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1125–1134. Cited by: §IIA.  [12] (2020) Handson bayesian neural networks–a tutorial for deep learning users. arXiv preprint arXiv:2007.06823. Cited by: §IIC.
 [13] (2019) A stylebased generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §IIA.
 [14] (2016) Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934. Cited by: §IIB.
 [15] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IIB.
 [16] (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §IVB.
 [17] (2020) A universal approximation theorem of deep neural networks for expressing distributions. arXiv preprint arXiv:2004.08867. Cited by: §IIA.
 [18] (1995) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 354 (1), pp. 73–80. Cited by: §IIC.
 [19] (1999) Neural networks for density estimation. Cited by: §IIA.
 [20] (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §IIA.
 [21] (2016) Approximate inference for deep latent gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, Vol. 2, pp. 131. Cited by: §IIB.
 [22] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §IIA.
 [23] (2014) Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278–1286. Cited by: §IIB.
 [24] (1990) Probabilistic neural networks. Neural networks 3 (1), pp. 109–118. Cited by: §IIA.
 [25] (2016) Bayesian optimization with robust bayesian neural networks. Advances in neural information processing systems 29, pp. 4134–4142. Cited by: §IIC.
 [26] (2008) Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: §IVB.
 [27] (2019) Selfattention generative adversarial networks. In International conference on machine learning, pp. 7354–7363. Cited by: §IIA.
 [28] (2017) Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §IIA.
 [29] (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §IIA.
Comments
There are no comments yet.