1 Introduction
Safetycritical applications of machine learning require calibrated predictive uncertainty, that is, neither over or underconfident uncertainty estimates. However, modern trends in deep neural network (DNNs) architecture design and training such as strong overparameterization, normalization, and regularization have been shown to have negative effects on calibration, yielding overconfident models (Guo et al., 2017). A natural solution is the application of Bayesian inference to DNNs, as it provides a sound framework for making optimal predictions under uncertainty (MacKay, 1992; Ovadia et al., 2019; Wilson, 2020).
The successful use of DNNs for a variety of realworld problems and tasks(McKinney et al., 2020; Silver et al., 2016; Poplin et al., 2018), has shown that the application of deep learning techniques to humanity’s most important problems is mainly held back by lack of usable data rather than by technological immaturity. This is particularly evident in the medical domain, where the application of machine learning has hitherto been limited by small, singleinstitutional and often nonrepresentative datasets, due to the sensitive nature of medical data and strict regulation governing its use. As a result, even large published studies (McKinney et al., 2020)
have utilized datasets that are an order of magnitude smaller than commonly used computer vision datasets such as ImageNet
(Deng et al., 2009). Overconfident models trained on small, nonrepresentative datasets are especially undesirable in domains such as medicine, where, contrary to generic computer vision tasks in which the quality of predictions is typically assessed in aggregate over a test set, decisions have direct, individual consequences for affected patients.Two –mutually complementary –solutions, can be employed to address the abovementioned issue: (1) The utilisation of privacyenhancing and federated learning techniques allows drawing conclusions from data to which direct access is not possible, increasing the effective dataset size and diversity. (2) The application of Bayesian inference to neural networks allows for accurate uncertainty quantification and optimal decision making under uncertainty, resulting in better calibrated models.
In this work, we exploit the similarities between stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011)
, a highly scalable stochastic gradient Markov Chain Monte Carlo (SGMCMC) method and differentially private stochastic gradient descent (DPSGD)
(Abadi et al., 2016), to provide formal privacy guarantees, while enabling more reliable and accurate uncertainty estimation.1.1 Main contributions

We show that DPSGD has a negative impact on model calibration compared to standard SGD, making its deployment in safetycritical applications such as automated medical diagnosis problematic

We highlight and exploit similarities between DPSGD and SGLD, to reformulate DPSGD as a temperaturescaled SGLD. Our novel reformulation offers more flexibility for achieving optimal privacyutility tradeoffs, by removing the need for a direct, 1:1 coupling of learning rate and noise scaling parameter

We provide empirical evidence that DPSGLD provides substantially better calibrated predictive uncertainty than standard DPSGD
2 Related Work
Prior work has established the natural link between differentially private training of neural networks and stochastic gradient Langevin dynamics (Wang et al., 2015; Li et al., 2019)
. Both however, do not take into account the effects of gradient clipping with regards to gradient bias. Moreover, step size decay is neglected in
(Li et al., 2019) for training of neural networks, a formal convergence requirement of SGLD which is often disregarded in practice. Guo et al. (Guo et al., 2017)showed that modern, very deep neural networks often provide miscalibrated uncertainty estimates and propose expected calibration error (ECE) as an empirical, approximate measure of miscalibration. ECE uses the maximum probability of the softmax output as a measure of model confidence to approximate the calibration error, however more recent work
(Nixon et al., 2019) identifies issues with this approach and introduces static calibration error (SCE), which measures calibration of all classes and adaptive calibration error (ACE), which picks bins for the approximation adaptively to contain similar amounts of samples (ACE). Posthoc (re)calibration methods (Platt and others, 1999; Guo et al., 2017) have been shown to significantly improve calibration without hindering the performance. They, however, rely on the assumption that the validation set used for recalibration is fully representative of the target distribution. This assumption may be problematic as some distributional shift between validation set and target distribution is to be expected, as supported by empirical evidence (Ovadia et al., 2019).So far, to the best of our knowledge, no prior work has analysed the benefits of utilising Langevin dynamics for private learning over standard DPSGD with respect to model calibration.
3 Background
3.1 Differential privacy and differentially private deep learning
Differential privacy (Dwork et al., 2006, 2014) (DP) is an informationtheoretic privacy definition providing an upper bound for the information gain from observing the output of an algorithm applied to a dataset.
Definition 1.
For some randomised algorithm , all subsets of its image , sensitive dataset and its neighbouring dataset , we say that is differentially private if, for a (typically small) constant and :
(1) 
Here neighbouring datasets and differ by at most one record and .
3.1.1 Differentially private deep learning
Definition 2.
Differentially private stochastic gradient descent (Abadi et al., 2016) (DPSGD) is generalization of differential privacy in the context of deep learning training. In DPSGD, minibatch gradients are privatized by clipping the persample gradients to an norm threshold
followed by addition of independent Gaussian noise with standard deviation
.(2) 
(3) 
Here represents the noise multiplier.
Recent work (Chen et al., 2020) has shown that the clipping operation in Eq. 2 creates geometric bias in the optimization trajectory of the loss landscape for DPSGD. (Chen et al., 2020) suggest to add Gaussian noise before clipping (referred to as prenoising) and prove that this helps to mitigate the geometric bias of the minibatch gradients.
3.2 Stochastic Gradient Descent with Langevin Dynamics (SGLD)
Stochastic gradient Markov chain Monte Carlo (SGMCMC) is part of a family of scalable Bayesian sampling algorithms that have recently (re)emerged in the context of training of deep learning models on large datasets (Wenzel et al., 2020). In stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011), Gaussian noise is added to the stochastic gradient descent update step. Through this addition of appropriately scaled noise proportional to the step size, the learning process converges to an MCMC chain and it is possible to draw samples from the posterior distribution over model parameters . To guarantee convergence on this distribution, SGLD has two formal requirements: a decaying step size and .
To allow for differentially private learning without a direct and restrictive 1:1 coupling of the learning rate and noise multiplier, we use a temperaturescaled reformulation of SGLD (Kim et al., 2020).
Definition 3 (Temperaturescaled SGLD).
Given , a set of independent and identically distributed samples from the data distribution. Let represent the prior distribution, the step size at time and the energy function of the posterior. The SGLD update rule is then given by:
(4)  
(5) 
where represents the temperature parameter and , a minibatch estimate of .
3.3 Model Calibration
We use the definition of (Nixon et al., 2019) for model calibration: Consider a dataset of data example pairs
which we assume to be independent, identically distributed (i.i.d.) realizations of the random variables
.Definition 4.
A model which predicts a class with probability is well calibrated if and only if is always the true probability.
(6) 
Where we define any difference between the right and left hand side of the above as calibration error.
Intuitively, this means, that a prediction from a wellcalibrated model with 0.7 confidence should be correct 70% of the time.
3.3.1 Expected Calibration Error
The expected calibration error (ECE) (Naeini et al., 2015), partitions Eq. 6 into equally spaced bins and calculates the weighted average of the difference between confidence (maximum of softmax output) and accuracy, a lower ECE is better. ECE is defined as follows and approximates the calibration error for the most likely class:
(7) 
4 Differentially private stochastic gradient descent with Langevin dynamics (DPSGLD)
To adapt the DPSGD algorithm to the Bayesian setting and allow for sampling from the posterior distribution, we replace the noise multiplier term in DPSGD with a temperaturescaled multiple of the learning rate: . The proposed changes are shown in Algorithm 1.
Note that, to satisfy the formal posterior convergence properties of SGLD, a decaying learning rate schedule is required.
4.1 Privacy Analysis
As Algorithm 1 is equivalent to DPSGD in the sense that the noise multiplier parameter is replaced by the learning rate and temperature to scale the standard deviation of the Gaussian noise, privacy accounting as originally proposed by (Abadi et al., 2016) remains unchanged, although we use use Gaussian differential privacy (Dong et al., 2019), which offers a tighter bound on the privacy loss. Formally, the noise added to the stochastic gradient update step after clipping in DPSGLD is distributed as follows:
(8) 
and in DPSGD:
(9) 
Thus it follows that and are equal if .
4.1.1 Privacy accounting with decaying noise
To perform privacy accounting with a decaying noise schedule we use the nfold composition theorem of Gaussian differential privacy (Dong et al., 2019):
For a sequence of GDP mechanisms composed over the dataset, the resulting mechanism is GDP.
As a result, we can simply account for every optimization step with unique noise addition separately and aggregate the resulting values to calculate the final GDP privacy guarantees, which can then be converted to an DP.
5 Experiments
We trained and tested our algorithm on two datasets (official traintest splits): MNIST (LeCun et al., 1998) and Pediatric Pneumonia dataset (Kermany et al., 2018). We report the performance and calibration metrics for DPSGD, SGD and DPSGLD (our algorithm). We note that the aim of our experiments was not to achieve new stateoftheart results but to highlight calibration differences between DPSGD and our proposed method. We report ECE as the only quantitative calibration metric as differences between ECE, SCE and ACE were negligible. For a visualisation of model calibration see the calibration curves in Fig. 1.
5.1 Results on MNIST
Results for a five layer convolutional neural network (CNN) trained on MNIST with the compared optimization procedures. Once the privacy budget (
at ) was exhausted, training was halted.5.2 Results on Pediatric Pneumonia Dataset
Results for a pretrained, frozen backbone (EfficientNet B1 (Tan and Le, 2019)
, ImageNet) with two trainable dense layers for the Pediatric Pneumonia dataset (PPD)
(Kermany et al., 2018) are shown in Table 2. PPD is a multiclass classification dataset comprising chest radiographs (classes: bacterial pneumonia, viral pneumonia and normal). Once the privacy budget ( at ) was exhausted, training was halted.6 Discussion
The proposed method provides substantially better calibrated predictions (MNIST fold, PPD
fold reduction in ECE) compared to standard DPSGD. Furthermore, DPSGLD provides uncertainty estimates with improved calibration across the ([0, 1]) model confidence interval as shown in Fig.
1 (A & B). The softmax output of DPSGD is concentrated about the extreme values of the interval, indicating overconfidence for the Pediatric Pneumonia dataset(Fig. 1 C). In contrast, the predicted probabilities of SGD and DPSGLD were more homogeneously distributed.Our work is not without limitations: DPSGLD provides only point estimates of the network’s parameters and thus sampling from the posterior when the privacy budget is exhausted is not possible. Furthermore, DPSGLD converges onto a single mode in the posterior distribution, and is thus not capable of capturing multiple, diverse solutions (modes). Future work could explore other, more expressive probabilistic formulations of Bayesian neural networks such as DeepEnsembles (Lakshminarayanan et al., 2016) or SWAG (Maddox et al., 2019) for differentially private training. Our method employs prenoising and we conjecture that this improves ergodicity in the posterior space, but leave it to future work to explore the effect of prenoising and/or the decaying learning rate schedule on model calibration.
7 Conclusion
We present DPSGLD, an elegant reformulation of DPSGD as Bayesian posterior inference and show that our approach yields much better calibrated models.
References
 Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: §1, §4.1, Definition 2.
 Understanding gradient clipping in private sgd: a geometric perspective. Advances in Neural Information Processing Systems 33. Cited by: §3.1.1.

Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §1.  Gaussian differential privacy. arXiv preprint arXiv:1905.02383. Cited by: §4.1.1, §4.1.
 Our data, ourselves: privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503. Cited by: §3.1.
 The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (34), pp. 211–407. Cited by: §3.1.
 On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §1, §2.
 Identifying medical diagnoses and treatable diseases by imagebased deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §5.2, §5.
 Stochastic gradient langevin dynamics algorithms with adaptive drifts. arXiv preprint arXiv:2009.09535. Cited by: §3.2.
 Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474. Cited by: §6.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.

On connecting stochastic gradient mcmc and differential privacy.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 557–566. Cited by: §2. 
A practical bayesian framework for backpropagation networks
. Neural computation 4 (3), pp. 448–472. Cited by: §1.  A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems 32, pp. 13153–13164. Cited by: §6.
 International evaluation of an ai system for breast cancer screening. Nature 577 (7788), pp. 89–94. Cited by: §1.
 Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: §3.3.1.
 Measuring calibration in deep learning.. In CVPR Workshops, Vol. 2. Cited by: §2, §3.3.
 Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530. Cited by: §1, §2.

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
.Advances in large margin classifiers
10 (3), pp. 61–74. Cited by: §2.  Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2 (3), pp. 158–164. Cited by: §1.
 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
 Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §5.2.
 Privacy for free: posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pp. 2493–2502. Cited by: §2.
 Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 681–688. Cited by: §1, §3.2.
 How good is the bayes posterior in deep neural networks really?. arXiv preprint arXiv:2002.02405. Cited by: §3.2.
 The case for bayesian deep learning. arXiv preprint arXiv:2001.10995. Cited by: §1.
Comments
There are no comments yet.