There have been significant efforts focused on identifying and quantifying uncertainties associated with machine learning and data science procedures (Gal and Ghahramani, 2016; Kendall and Gal, 2017; Srivastava et al., 2015). However, recent progresses in machine learning models such as deep neural networks are unsatisfactory at quantifying uncertainty and tend to produce overconfident predictions (Lakshminarayanan et al., 2017). This article proposes a model-based framework for quantifying uncertainty in the inference of parameters. Uncertainty quantification (UQ) aims to provide a paradigm where uncertainty can be reviewed and ideally, assessed, in a manner relevant to researchers using the predictive models. This is conducted by searching an Uncertainty Quantification (UQ) distribution of parameters in interest that minimizes a distance between the predictive distribution based on the UQ distribution and the empirical distribution of the observed data. Since directly evaluating the UQ distribution (e.g., the density function) is computationally challenging, we consider a parameter generator that generates parameter samples that follow the minimizer UQ distribution to ease the computational difficulty. In this sense, we call the proposed procedure predictive-matching Generative Parameter Sampler (GPS), which is implemented by a stochastic optimization algorithm so that the computation is scalable for large-sized data sets. Our main contributions are summarized as follows:
We propose a new framework to quantify uncertainty of models and parameters according to predictive-matching with the observed data. The UQ based on the GPS is predictive optimal in a sense that the resulting predictive distribution (1) based on the UQ distribution (defined in (2)) is as close as possible to the empirical distribution of the observed data.
We show that GPS induces a set of fully independent structure based on the individual parameterization (Figure 1 (a)) in contrast to the global parameterization for standard Bayesian (or frequentist) models (Figure 1 (b)). We report that this individual parameterization is more robust to outliers and capable of capturing features that cannot be detected by standard methods (e.g. ’scissors’ example in Figure 2)
We demonstrate that the computation of the GPS is scalable. This is because the parameter generator for the GPS can be computed by standard stochastic optimization methods.
2 Predictive Matching Generative Parameter Sampler
The main idea of GPS abstracts the notion of an Integral Probability Metric
Integral Probability Metric(IPM) (Müller, 1997) to measure a distance between the predictive distribution and its empirical counterpart. Some examples of IPM include Wasssestein distance (Villani, 2008), energy distance (Székely and Rizzo, 2013), and Maximum Mean Discrepancy (MMD) (Gretton et al., 2012). For computational convenience, we use MMD as a default choice of the distance in this article.
For , we posit that our observations are generated from a density function with parameters , independently for
. We define a random variablethat can be easily generated (such as ), and pass it through a function to generate a parameter . We call the generator, and we denote the density function of induced from the generator by . The predictive distribution for can thus be written succinctly as
We model the generator by a neural network and we denote the parameter of the generator by . The generator function evaluated by parameter is denoted by . We proceed to give a definition of the optimal as below:
The optimal generator of GPS is defined as the minimizer of the MMD between the predictive distribution and the empirical distribution :
Following that, the UQ distribution is defined by the distribution of , where .
For convenience, we consider the negation of MMD here. Once the generator is trained, the parameters can be generated from the UQ distribution by the following procedure: i) sample ; ii) evaluate for , where is some large-sized integer. We then use the sampled parameters to derive the characteristics of the UQ distribution such as the mean or marginal 95% uncertainty interval, say. The UQ for prediction can be also easily implemented by sampling for , and a prediction interval can be evaluated from the empirical distribution of the sampled predictions.
Remark. The core of Bayesian procedures emphasizes the posterior that is defined as a conditional distribution of the parameter given the observations. On the other hand, while our parameter is modeled by a distribution (as in the Bayesian setting), the UQ distribution of the GPS does not follow a conditional law, which distinguishes the GPS from the standard Bayesian framework.
Individual parameterization for GPS. Individual parametrization indicates that each is related to a different (confer Figure 0(a)). The individual parameterization in the GPS assumes that there exists a true parameter-generating law for each for . This setup can be considered as a slight generalization of the global parameterization (Figure 0(b)). This is because placing a point mass on a fixed results in a standard frequentist model.
The main reason of constructing the individual parameterization is to exploit a scalable computation under the fully independent structure between the observations. This means that any set of generated predictive samples can be used to approximate the MMD in optimization. In contrast, under the classical setup as in Figure 0(b), the computation of a distance in GPS requires a generation of a full set (-sized) of predictive samples to completely quantify the conditional reasoning . This standard Bayesian models are computationally demanding for large-sized data sets.
The individual parameterization has not been commonly used in practice. In Bayesian statistics, Dirichlet Process Mixture (DPM) models follow the individual parameterization by imposing a DPM prior on the parameters(Teh et al., 2005; Teh, 2010; MacEachern and Müller, 1998), but the MCMC computation of the posterior distribution is extremely demanding especially for large-sized data sets. This is because every iteration the MCMC algorithm samples number of s. Moreover, the DPM models have a nature of discreteness in the posterior distribution. While its discreteness property is helpful in clustering problems without specifying the number of clusters a priori, it hurdles the practical use of the DPM to more general settings where the posterior space is continuous. For these reasons, we do not examine the DPM models further in sequel.
In a frequentist paradigm, the individual parameterization is not considered because it is assumed that the true parameter is fixed without any randomness on the parameter. While some hierarchical models such as random-effects models are considered in practice, its main focus is not on estimating the distribution of, but in reducing some random-effects that may affect the estimation of parameters of interest.
The well-known Gaussian sequence model, , also inherits the individual parmeterization setup. However, this model is not designed to analyze real data sets, but its usage is mainly focused on investigating theoretical properties of various statistical procedures. These include risk minimization (Stein, 1981; Johnstone et al., 2004; Castillo et al., 2012), high-dimensional variable selection (Castillo et al., 2015; Ročková and others, 2018) and nonparametric function estimation (Johnstone, 1999). For this reason, we also do not discuss the Gaussian sequence model further.
In Section 3, we present a thorough investigation on the advantages of using the individual parameterization setup.
Objective function and computation. Using the definition of MMD (Gretton et al., 2012), our objective function for GPS takes the form
where is a positive-definite kernel. In this article, we use a Gaussian kernel as a canonical choice. In practice, any valid kernels may be used as well. Algorithm 1 demonstrates a Stochastic Gradient Descent (SGD) algorithm to minimize the objective function (3). The main idea of this algorithm is to approximate the expectation in (3) by a Monte Carlo (MC) procedure. At every iteration, its MC approximation is conducted by sampling different parameters accompanied by their predictive samples. The number of samples for the MC approximation can be practically small, such as and , say. We show that this setting achieves superior empirical performance in Section 4. While Algorithm 1 adopts an SGD procedure, other stochastic optimization procedures, such as Adagrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2014), can be deployed to accelerate the convergence.
3.1 Linear regression.
We first consider the case of and , where is generated from i.i.d. standard Gaussian. The true data-generating process follows , where and for .
The resulting scatter plot of a synthetic data set is illustrated in Figure 2 (left). In this example, the classical global parameterization is not capable of capturing the ‘scissors’ like shape. As a consequence, we see from Figure 2 (middle) that the standard posterior distribution of is concentrated on zero, when a uniform prior is used, which is completely different from the true nature. In contrast, Figure 2 (right) shows that the UQ distribution derived from the GPS captures exactly the bimodal shape of the true distribution on the slope parameter ( with and with ). This example shows that compared to standard procedures, the GPS covers a wider range of true data-generating processes.
To examine the robustness of the GPS to outliers, we consider an example as follows: we generate i.i.d. from , where , if and for , , and
. The 5% of response samples are randomly set to be outliers and they are contaminated with extra noises generated from a Cauchy distribution with a scale value of.
We compare the UQ distribution to the Bayesian posterior distribution (with a uniform prior) as illustrated in the top row of Figure 3. While the existence of outliers distorts the posterior behavior, the resulting UQ distribution of GPS is reasonably concentrated around the true parameters. This show that GPS is more robust to outliers than the standard Bayesian procedure in this simple example.
These results present that the GPS is robust to outliers as well as being able to recover non-standard features (‘scissors’ shape). These desirable properties stem from a characteristic of the individual parameterization such that each observation is only affected from an individual parameter without interfering the other parameters. This explains why the GPS is robust to outliers. Even if there exist a small number of outliers, under the individual parameterization the effect of the outliers is minimal to the individual parameters corresponding to non-outliers.
We also consider a classical scenario where the assumptions for the linear model are fully satisfied. We examine the same setting used in the previous ‘outlier’ example, but this classical example contains no outliers. The results are illustrated in the bottom row of Figure 3. The UQ distribution behaves similarly with the Bayesian posterior distribution.
3.2 Poisson processes with random intensity.
We consider an inhomogeneous Poisson process on a spatial domain which is parameterized by an intensity function (Kingman, 1993). The random number of events in region is Poisson random variable with parameter .
The GPS is tested on a synthetic data set generated with the true intensity function. This intensity function is illustrated as a heatmap in the third plot of Figure 4. To embed this in our GPS framework, we let and , where . We then train the generator which is modulating the intensity function . From the learned intensity through the GPS, we generate the points via thinning (Lewis and Shedler, 1978). The first plot of Figure 4 illustrates one path of the simulated counts compared with the realized observations. The fourth figure depicts the mean of intensity over samples drawn from the GPS. Overall, this confirms that GPS operates sensibly in that it is able to recover the ground truth.
3.3 Deep neural network for classification
In image classification, Deep Neural Network (DNN) models have been widely used by deploying convolutional layers (Krizhevsky et al., 2012), while the uncertainty quantification on the prediction of the classification still remains a challenging task. To solve this uncertainty quantification problem, we apply the GPS to the classification model with number of classes that follows: for , , where is the the -th predictor (or image) corresponding to , and
denotes the classifier of
for each class. The classifier function is commonly modeled by a DNN (e.g., Convolutional Neural Network (CNN)) by imposingas the input and as the output, and the cross entropy loss can be used to train the DNN in standard procedures.
For this application, we set the GPS as such that , where denotes the parameters of the generator and , the dimension of . The only difference from this GPS setting and the standard DNN structures is that this generator embraces the randomness to the neural network by augmenting the random noise variable into the input, so that the resulting neural network is naturally random (see Figure 5). After learning the parameters for the generator by optimizing the objective function in (3), the trained generator of the classifiers then generates the classification probabilities that captures the uncertainty in matching the predictions and the observations. The CNN based on GPS can be trained by implementing Algorithm 1.
This GPS structure in Figure 5 also enjoys a computational advantage over the other uncertainty quantification procedures based on variational inference. These include inverse autoregressive flow (Kingma et al., 2016), non-linear independent components estimation (Dinh et al., 2014) and normalizing flow (Rezende and Mohamed, 2015). Like the GPS, these procedures also consider an idea of generator using a noise random variable. However, unlike the GPS, these procedures approximate the Bayesian posterior distribution of the DNN (or CNN) parameter that is usually extremely high-dimensional. Constructing a neural network that transforms a random noise to such a high dimension (the number of parameters in the target DNN) is a computationally challenging task. For example, it may require more than tens of billions of generator parameters to be trained, when the target dimension is a few millions. In contrast, the GPS directly models the classifier function by using a DNN augmenting the random noise into the input. The only difference from the deterministic DNN is that the input dimension is increased just by the dimension of . This dimension can be controlled by the user. In Section 4, we show that the performance of the generator is not sensitive to the choice of the dimension of in some examples.
In addition to the problem of high-dimensionality, the variational inference procedures also require a strict restriction on the network structure of the generator. This is to ease the computation of the determinant of the resulting Jacobian term. This restriction sometimes may slow down the convergence of training, because it limits the flexibility of the neural network. In contrast, the GPS bypasses the calculation of the Jacobian term by adopting the MMD as the distance measure.
4 Experiments on MNIST and CiFAR10 Data Sets
Uncertain images in classification. We first describe the meaning of “Uncertain” images in the classification model based on the GPS. Intuitively, uncertain images are images that the classification probabilities induced by the UQ distribution have large probabilistic fluctuations so that none of the classification probabilities dominate the other classes.
In order to carry out our experiment, we formally define those images that are uncertain in the following manner. For each class and an image , we propose an Uncertainty Quantification Criterion (UQC), denoted by such that
where is the classification probability of the class , i.e., for . The value of can easily be approximated by generating classification probabilities via the trained UQ distribution. We call an image “Uncertain” if every corresponding to the image does not exceed 50%. This means that the classification probabilities of an “Uncertain” image are not dominated by a single class. If an image is not “Uncertain”, we call it “Certain”. An example of “Uncertain” images in the MNIST data set is illustrated in Figure 5(a), and the distribution of marginal classification probabilities for digits “2”, “4”, and “6” are too disperse so that none of classification probabilities satisfies the condition of certainty.
The categorization of “Uncertain” images is practically useful in a sense that the model uncertainty examines whether the model is certain about the prediction on a given data set. This uncertainty quantification procedure enables us to avoid an overconfident decision under an existence of high uncertainty for prediction. In the following, we show that the error rate of classification can be dramatically reduced by ignoring “Uncertain” images.
Classification under uncertainty quantification. Here we use a CNN (Krizhevsky et al., 2012) for the MNIST and a ResNet (He et al., 2016) for the CiFAR10 data set (the detailed settings are listed in the Appendix). For all procedures, we consider epochs and epochs in optimization for the MNIST and the CiFAR10 in training, respectively. The mini-batch size is set to be for both data sets. We set the dimension of to be , where is the original input dimension to the feed-forwarding network for . The classification tasks for the GPS are implemented by using the mean of the classification probabilities based on the UQ distribution, i.e., for . To train neural networks, we use Algorithm 1 with , and , and its pytorch code is available in https://github.com/minsuk000/GPS.
To examine the quality of the uncertainty quantification by the GPS, we compare our GPS to the Monte Carlo dropout (Gal and Ghahramani, 2016) (MCDrop in short). The MCDrop procedure is implemented by using a dropout (Srivastava et al., 2014) during training of the neural network. Then, the output of the trained network is evaluated by randomly dropping the nodes in the network, so the resulting output is random. Gal and Ghahramani (2016) showed that the MCDrop is an approximation of the variational predictive posterior distribution that minimizes the KL-divergence towards a deep Gaussian process (Damianou and Lawrence, 2013). Due to the stochastic nature in the MCDrop procedure, we can also evaluate the UQC as in the GPS, and “Uncertain” images can be determined by the UQC derived from the MCDrop.
|Method||Error Rate (Full)||Error Rate (w/o “Uncertain” )||% of “Uncertain”|
|Method||% of “Uncertain” from unseen class||% of “Uncertain” from random noise|
In Table 1, we report the classification performance of the standard procedure and our GPS procedure. More precisely, we provide the error rate of classification from the full test data set, the proportion of “Uncertain” images, and the error rate from the test data set excluding the “Uncertain” images. The GPS procedure achieves competitive classification performance for both MNIST and CiFAR10 data sets. After discarding “Uncertain” images, the classification accuracy is dramatically improved for the GPS procedure. Especially for the MNIST data set, only less than 3% of images are discarded, but the error rate is decreased by a factor of six; from to () and from to (). These results outperform the state-of-art error rate for the MNIST data set (; Wan et al. (2013)). This improvement means that the UQ derived by the GPS is truly capable of capturing uncertain situations and helping practitioners to make a better decision. In contrast, even though the proportion of “Uncertain” images determined by the MCDrop is slightly less than that by the GPS, the error rate on its “Certain” images is more than twice of these from the GPS.
Overconfident predictions on unseen classes in classifications are problematic and calls for caution. In an ideal situation, we would like the GPS to detect higher uncertainty when the test data significantly differs from the training data. To check if the GPS inherits this desirable property, we consider an extra experiment where we remove one class of images (digits “1” for MNIST and class “truck” for CiFAR10) in training steps. Then, we compute the classification probabilities evaluated from the unused images and random noise generated from i.i.d uniform distribution fromand . We use for the GPS.
Figure 5(b) presents an example of images with digit “1” and their classification probabilities under a setting described in the previous paragraph. Because the digit “1” has not been used in the training steps, it is reasonable that the classification probability of each class is not close to unity and the resulting distribution should be disperse. The value of for each class is close to zero and classification probabilities of some classes are widely distributed in the range of and .
Table 2 contains the proportion of “Uncertain” images from an unseen class during training steps for the MNIST and the CiFAR10 data set. In the MNIST example, the GPS procedure detects images that are significantly different from the training data set with high chance ( for digit “1” and for random noise), while the MCDrop determines only less than 70% images to “Uncertain”. In the CiFAR10, the detection rate of the GPS is for class “truck” and for random noise. In contrast, the rate of the MCDrop is around 50%.
We proposed a general framework, called GPS, that is computationally scalable in quantifying uncertainty in estimating parameters. We showed that the GPS can be applied to a wide range of models, e.g., linear models, Poisson processes, and deep learning. With experiments carried out, we conclude that the GPS is successful in providing a model-based uncertainty quantification.
- Castillo et al.  Ismaël Castillo, Aad van der Vaart, et al. Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics, 40(4):2069–2101, 2012.
- Castillo et al.  Ismaël Castillo, Johannes Schmidt-Hieber, Aad Van der Vaart, et al. Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986–2018, 2015.
- Damianou and Lawrence  Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
- Dinh et al.  Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Duchi et al.  John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Gal and Ghahramani  Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
- Gretton et al.  Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
- Johnstone et al.  Iain M Johnstone, Bernard W Silverman, et al. Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649, 2004.
- Johnstone  Iain M Johnstone. Wavelets and the theory of non-parametric function estimation. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 357(1760):2475–2493, 1999.
- Kendall and Gal  Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma et al.  Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
- Kingman  John Frank Charles Kingman. Poisson Processes. Wiley Online Library, 1993.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Krizhevsky  Alex Krizhevsky. Learning multiple layers of features from tiny images. Uniersity of Toronto, Master’s Thesis:Department of Computer Science, 2009.
- Lakshminarayanan et al.  Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30, pages 6402–6413. 2017.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
- Lewis and Shedler  Peter A.W. Lewis and Gerald S. Shedler. Simulation methods for poisson processes in nonstationary systems. WSC ’78. IEEE Press, 1978.
- MacEachern and Müller  Steven N MacEachern and Peter Müller. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics, 7(2):223–238, 1998.
- Müller  Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
- Rezende and Mohamed  Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, pages 1530–1538, 2015.
- Ročková and others  Veronika Ročková et al. Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics, 46(1):401–437, 2018.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Srivastava et al.  Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. Wasp: Scalable bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920, 2015.
Charles M Stein.
Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981.
- Székely and Rizzo  Gábor J Székely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013.
- Teh et al.  Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Sharing clusters among related groups: Hierarchical dirichlet processes. In Advances in neural information processing systems, pages 1385–1392, 2005.
- Teh  Yee Whye Teh. Dirichlet process. Encyclopedia of machine learning, pages 280–287, 2010.
- Villani  Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Wan et al.  Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013.
Detailed Neural Net Settings. In this section, we provide the detailed settings of the deep convolutional neural networks for the MNIST data set and the CiFAR10 data set used in Section 4.
We adopt the notation that conv denotes a convolutional layer with filters of size
, with strideand
pixel padding, andmax-pool is a max-pooling layer with stride , and BN denotes a batch normalization [Ioffe and Szegedy, 2015] step. We denote an average-pooling layer with stride by avg-pool
. Also the ReLU function,, is denoted by ReLU, and convres denotes a convolutional layer conv with an addition to an identity function, used in the ResNet [He et al., 2016].
The CNN used for the MNIST data set follows as conv-BN-ReLU-max-pool-conv-BN-ReLU-max-pool, and the flattened output of this convolutional layer (size
) is connected to a 6-layered feed-forwarding neural network. Each layer in the feed-forwarding network contains the same number of nodes of, and each node in the feed-forwarding network is batch-normalized.
The ResNet for the CiFAR data set follows conv-BN-ReLU-convres-BN-ReLU-convres-BN-ReLU-convres-BN-ReLU-convres-BN-ReLU-convres-BN-ReLU-convres-BN-ReLU-convres-BN-ReLU, and the output of this convolutional layer (size ) is connected to a 6-layered feed-forwarding neural network. Each layer in the feed-forwarding network contains the same number of nodes of , and each node in the feed-forwarding network is batch-normalized.