Differentially private training of neural networks with Langevin dynamics for calibrated predictive uncertainty

by   Moritz Knolle, et al.

We show that differentially private stochastic gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models. This represents a serious issue for safety-critical applications, e.g. in medical diagnosis. We highlight and exploit parallels between stochastic gradient Langevin dynamics, a scalable Bayesian inference technique for training deep neural networks, and DP-SGD, in order to train differentially private, Bayesian neural networks with minor adjustments to the original (DP-SGD) algorithm. Our approach provides considerably more reliable uncertainty estimates than DP-SGD, as demonstrated empirically by a reduction in expected calibration error (MNIST ∼5-fold, Pediatric Pneumonia Dataset ∼2-fold).



There are no comments yet.


page 1

page 2

page 3

page 4


NeuralDP Differentially private neural networks by design

The application of differential privacy to the training of deep neural n...

Differentially Private SGD with Sparse Gradients

To protect sensitive training data, differentially private stochastic gr...

Differentially private training of residual networks with scale normalisation

We investigate the optimal choice of replacement layer for Batch Normali...

Differentially Private Coordinate Descent for Composite Empirical Risk Minimization

Machine learning models can leak information about the data used to trai...

DP-MAC: The Differentially Private Method of Auxiliary Coordinates for Deep Learning

Developing a differentially private deep learning algorithm is challengi...

PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning

We propose a new framework of synthesizing data using deep generative mo...

Differentially Private Variational Autoencoders with Term-wise Gradient Aggregation

This paper studies how to learn variational autoencoders with a variety ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Safety-critical applications of machine learning require calibrated predictive uncertainty, that is, neither over- or under-confident uncertainty estimates. However, modern trends in deep neural network (DNNs) architecture design and training such as strong overparameterization, normalization, and regularization have been shown to have negative effects on calibration, yielding overconfident models (Guo et al., 2017). A natural solution is the application of Bayesian inference to DNNs, as it provides a sound framework for making optimal predictions under uncertainty (MacKay, 1992; Ovadia et al., 2019; Wilson, 2020).

The successful use of DNNs for a variety of real-world problems and tasks(McKinney et al., 2020; Silver et al., 2016; Poplin et al., 2018), has shown that the application of deep learning techniques to humanity’s most important problems is mainly held back by lack of usable data rather than by technological immaturity. This is particularly evident in the medical domain, where the application of machine learning has hitherto been limited by small, single-institutional and often non-representative datasets, due to the sensitive nature of medical data and strict regulation governing its use. As a result, even large published studies (McKinney et al., 2020)

have utilized datasets that are an order of magnitude smaller than commonly used computer vision datasets such as ImageNet

(Deng et al., 2009). Overconfident models trained on small, non-representative datasets are especially undesirable in domains such as medicine, where, contrary to generic computer vision tasks in which the quality of predictions is typically assessed in aggregate over a test set, decisions have direct, individual consequences for affected patients.

Two –mutually complementary –solutions, can be employed to address the above-mentioned issue: (1) The utilisation of privacy-enhancing and federated learning techniques allows drawing conclusions from data to which direct access is not possible, increasing the effective dataset size and diversity. (2) The application of Bayesian inference to neural networks allows for accurate uncertainty quantification and optimal decision making under uncertainty, resulting in better calibrated models.

In this work, we exploit the similarities between stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011)

, a highly scalable stochastic gradient Markov Chain Monte Carlo (SG-MCMC) method and differentially private stochastic gradient descent (DP-SGD)

(Abadi et al., 2016), to provide formal privacy guarantees, while enabling more reliable and accurate uncertainty estimation.

1.1 Main contributions

  • We show that DP-SGD has a negative impact on model calibration compared to standard SGD, making its deployment in safety-critical applications such as automated medical diagnosis problematic

  • We highlight and exploit similarities between DP-SGD and SGLD, to reformulate DP-SGD as a temperature-scaled SGLD. Our novel reformulation offers more flexibility for achieving optimal privacy-utility trade-offs, by removing the need for a direct, 1:1 coupling of learning rate and noise scaling parameter

  • We provide empirical evidence that DP-SGLD provides substantially better calibrated predictive uncertainty than standard DP-SGD

2 Related Work

Prior work has established the natural link between differentially private training of neural networks and stochastic gradient Langevin dynamics (Wang et al., 2015; Li et al., 2019)

. Both however, do not take into account the effects of gradient clipping with regards to gradient bias. Moreover, step size decay is neglected in

(Li et al., 2019) for training of neural networks, a formal convergence requirement of SGLD which is often disregarded in practice. Guo et al. (Guo et al., 2017)

showed that modern, very deep neural networks often provide miscalibrated uncertainty estimates and propose expected calibration error (ECE) as an empirical, approximate measure of miscalibration. ECE uses the maximum probability of the softmax output as a measure of model confidence to approximate the calibration error, however more recent work

(Nixon et al., 2019) identifies issues with this approach and introduces static calibration error (SCE), which measures calibration of all classes and adaptive calibration error (ACE), which picks bins for the approximation adaptively to contain similar amounts of samples (ACE). Post-hoc (re)-calibration methods (Platt and others, 1999; Guo et al., 2017) have been shown to significantly improve calibration without hindering the performance. They, however, rely on the assumption that the validation set used for recalibration is fully representative of the target distribution. This assumption may be problematic as some distributional shift between validation set and target distribution is to be expected, as supported by empirical evidence (Ovadia et al., 2019).
So far, to the best of our knowledge, no prior work has analysed the benefits of utilising Langevin dynamics for private learning over standard DP-SGD with respect to model calibration.

3 Background

3.1 Differential privacy and differentially private deep learning

Differential privacy (Dwork et al., 2006, 2014) (DP) is an information-theoretic privacy definition providing an upper bound for the information gain from observing the output of an algorithm applied to a dataset.

Definition 1.

For some randomised algorithm , all subsets of its image , sensitive dataset and its neighbouring dataset , we say that is -differentially private if, for a (typically small) constant and :


Here neighbouring datasets and differ by at most one record and .

3.1.1 Differentially private deep learning

Definition 2.

Differentially private stochastic gradient descent (Abadi et al., 2016) (DP-SGD) is generalization of differential privacy in the context of deep learning training. In DP-SGD, mini-batch gradients are privatized by clipping the per-sample gradients to an -norm threshold

followed by addition of independent Gaussian noise with standard deviation



Here represents the noise multiplier.

Recent work (Chen et al., 2020) has shown that the clipping operation in Eq. 2 creates geometric bias in the optimization trajectory of the loss landscape for DP-SGD. (Chen et al., 2020) suggest to add Gaussian noise before clipping (referred to as pre-noising) and prove that this helps to mitigate the geometric bias of the mini-batch gradients.

3.2 Stochastic Gradient Descent with Langevin Dynamics (SGLD)

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) is part of a family of scalable Bayesian sampling algorithms that have recently (re)-emerged in the context of training of deep learning models on large datasets (Wenzel et al., 2020). In stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011), Gaussian noise is added to the stochastic gradient descent update step. Through this addition of appropriately scaled noise proportional to the step size, the learning process converges to an MCMC chain and it is possible to draw samples from the posterior distribution over model parameters . To guarantee convergence on this distribution, SGLD has two formal requirements: a decaying step size and .

To allow for differentially private learning without a direct and restrictive 1:1 coupling of the learning rate and noise multiplier, we use a temperature-scaled reformulation of SGLD (Kim et al., 2020).

Definition 3 (Temperature-scaled SGLD).

Given , a set of independent and identically distributed samples from the data distribution. Let represent the prior distribution, the step size at time and the energy function of the posterior. The SGLD update rule is then given by:


where represents the temperature parameter and , a mini-batch estimate of .

3.3 Model Calibration

We use the definition of (Nixon et al., 2019) for model calibration: Consider a dataset of data example pairs

which we assume to be independent, identically distributed (i.i.d.) realizations of the random variables


Definition 4.

A model which predicts a class with probability is well calibrated if and only if is always the true probability.


Where we define any difference between the right and left hand side of the above as calibration error.

Intuitively, this means, that a prediction from a well-calibrated model with 0.7 confidence should be correct 70% of the time.

3.3.1 Expected Calibration Error

The expected calibration error (ECE) (Naeini et al., 2015), partitions Eq. 6 into equally spaced bins and calculates the weighted average of the difference between confidence (maximum of softmax output) and accuracy, a lower ECE is better. ECE is defined as follows and approximates the calibration error for the most likely class:


4 Differentially private stochastic gradient descent with Langevin dynamics (DP-SGLD)

To adapt the DP-SGD algorithm to the Bayesian setting and allow for sampling from the posterior distribution, we replace the noise multiplier term in DP-SGD with a temperature-scaled multiple of the learning rate: . The proposed changes are shown in Algorithm 1.

  Input: Examples

, Loss function

  Params: temperature , gradient clipping bound , pre-noise scale , decaying learning rate and mini-batch size
  for  do
     Sample mini-batch from training data with probability
     for each in  do
        Compute per example gradient
        Pre-noise gradients
        Clip gradient
     end for
     Add noise
     Perform parameter update step
  end for
Algorithm 1 DP-SGLD Algorithm

Note that, to satisfy the formal posterior convergence properties of SGLD, a decaying learning rate schedule is required.

4.1 Privacy Analysis

As Algorithm 1 is equivalent to DP-SGD in the sense that the noise multiplier parameter is replaced by the learning rate and temperature to scale the standard deviation of the Gaussian noise, privacy accounting as originally proposed by (Abadi et al., 2016) remains unchanged, although we use use Gaussian differential privacy (Dong et al., 2019), which offers a tighter bound on the privacy loss. Formally, the noise added to the stochastic gradient update step after clipping in DP-SGLD is distributed as follows:


and in DP-SGD:


Thus it follows that and are equal if .

4.1.1 Privacy accounting with decaying noise

To perform privacy accounting with a decaying noise schedule we use the n-fold composition theorem of Gaussian differential privacy (Dong et al., 2019):

For a sequence of -GDP mechanisms composed over the dataset, the resulting mechanism is -GDP.

As a result, we can simply account for every optimization step with unique noise addition separately and aggregate the resulting values to calculate the final GDP privacy guarantees, which can then be converted to an -DP.

5 Experiments

We trained and tested our algorithm on two datasets (official train-test splits): MNIST (LeCun et al., 1998) and Pediatric Pneumonia dataset (Kermany et al., 2018). We report the performance and calibration metrics for DPSGD, SGD and DP-SGLD (our algorithm). We note that the aim of our experiments was not to achieve new state-of-the-art results but to highlight calibration differences between DP-SGD and our proposed method. We report ECE as the only quantitative calibration metric as differences between ECE, SCE and ACE were negligible. For a visualisation of model calibration see the calibration curves in Fig. 1.

5.1 Results on MNIST

Results for a five layer convolutional neural network (CNN) trained on MNIST with the compared optimization procedures. Once the privacy budget (

at ) was exhausted, training was halted.

Figure 1: Calibration curves for MNIST (A) and Pediatric Pneumonia (B), alongside histogram plots of softmax output for Pediatric Pneumonia (C), comparing SGD (blue), DP-SGD (orange) and DP-SGLD (green). Confidence is the maximum value of the softmax output. Perfect calibration, defined as the equality of confidence and accuracy across the whole probability interval is shown by the dashed line. Points below the dashed line represent underconfident predictions, while points above represent overconfident predictions.

max width=0.5 Procedure Accuracy AUC ECE SGD 0.984 0.999 0.0020 DP-SGD 0.5 0.967 0.996 0.0210 DP-SGLD (ours) 0.5 0.963 0.995 0.0044

Table 1: Performance and calibration metrics for SGD, DP-SGD and DP-SGLD on MNIST.

5.2 Results on Pediatric Pneumonia Dataset

Results for a pre-trained, frozen backbone (EfficientNet B1 (Tan and Le, 2019)

, ImageNet) with two trainable dense layers for the Pediatric Pneumonia dataset (PPD)

(Kermany et al., 2018) are shown in Table 2. PPD is a multi-class classification dataset comprising chest radiographs (classes: bacterial pneumonia, viral pneumonia and normal). Once the privacy budget ( at ) was exhausted, training was halted.

max width=0.5 Procedure Accuracy AUC ECE SGD 0.857 0.942 0.0526 DP-SGD 6.0 0.783 0.910 0.1790 DP-SGLD (ours) 6.0 0.786 0.920 0.0833

Table 2: Performance and calibration metrics for SGD, DP-SGD and DP-SGLD on the Pediatric Pneumonia dataset.

6 Discussion

The proposed method provides substantially better calibrated predictions (MNIST -fold, PPD

-fold reduction in ECE) compared to standard DP-SGD. Furthermore, DP-SGLD provides uncertainty estimates with improved calibration across the ([0, 1]) model confidence interval as shown in Fig.

1 (A & B). The softmax output of DP-SGD is concentrated about the extreme values of the interval, indicating overconfidence for the Pediatric Pneumonia dataset(Fig. 1 C). In contrast, the predicted probabilities of SGD and DP-SGLD were more homogeneously distributed.

Our work is not without limitations: DP-SGLD provides only point estimates of the network’s parameters and thus sampling from the posterior when the privacy budget is exhausted is not possible. Furthermore, DP-SGLD converges onto a single mode in the posterior distribution, and is thus not capable of capturing multiple, diverse solutions (modes). Future work could explore other, more expressive probabilistic formulations of Bayesian neural networks such as DeepEnsembles (Lakshminarayanan et al., 2016) or SWAG (Maddox et al., 2019) for differentially private training. Our method employs pre-noising and we conjecture that this improves ergodicity in the posterior space, but leave it to future work to explore the effect of pre-noising and/or the decaying learning rate schedule on model calibration.

7 Conclusion

We present DP-SGLD, an elegant reformulation of DP-SGD as Bayesian posterior inference and show that our approach yields much better calibrated models.


  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: §1, §4.1, Definition 2.
  • X. Chen, S. Z. Wu, and M. Hong (2020) Understanding gradient clipping in private sgd: a geometric perspective. Advances in Neural Information Processing Systems 33. Cited by: §3.1.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §1.
  • J. Dong, A. Roth, and W. J. Su (2019) Gaussian differential privacy. arXiv preprint arXiv:1905.02383. Cited by: §4.1.1, §4.1.
  • C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor (2006) Our data, ourselves: privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503. Cited by: §3.1.
  • C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (3-4), pp. 211–407. Cited by: §3.1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §1, §2.
  • D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, et al. (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §5.2, §5.
  • S. Kim, Q. Song, and F. Liang (2020) Stochastic gradient langevin dynamics algorithms with adaptive drifts. arXiv preprint arXiv:2009.09535. Cited by: §3.2.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2016) Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474. Cited by: §6.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
  • B. Li, C. Chen, H. Liu, and L. Carin (2019) On connecting stochastic gradient mcmc and differential privacy. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    pp. 557–566. Cited by: §2.
  • D. J. MacKay (1992)

    A practical bayesian framework for backpropagation networks

    Neural computation 4 (3), pp. 448–472. Cited by: §1.
  • W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson (2019) A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems 32, pp. 13153–13164. Cited by: §6.
  • S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. S. Corrado, A. Darzi, et al. (2020) International evaluation of an ai system for breast cancer screening. Nature 577 (7788), pp. 89–94. Cited by: §1.
  • M. P. Naeini, G. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: §3.3.1.
  • J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran (2019) Measuring calibration in deep learning.. In CVPR Workshops, Vol. 2. Cited by: §2, §3.3.
  • Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530. Cited by: §1, §2.
  • J. Platt et al. (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods


    Advances in large margin classifiers

    10 (3), pp. 61–74.
    Cited by: §2.
  • R. Poplin, A. V. Varadarajan, K. Blumer, Y. Liu, M. V. McConnell, G. S. Corrado, L. Peng, and D. R. Webster (2018) Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2 (3), pp. 158–164. Cited by: §1.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
  • M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §5.2.
  • Y. Wang, S. Fienberg, and A. Smola (2015) Privacy for free: posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pp. 2493–2502. Cited by: §2.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Cited by: §1, §3.2.
  • F. Wenzel, K. Roth, B. S. Veeling, J. Świkatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin (2020) How good is the bayes posterior in deep neural networks really?. arXiv preprint arXiv:2002.02405. Cited by: §3.2.
  • A. G. Wilson (2020) The case for bayesian deep learning. arXiv preprint arXiv:2001.10995. Cited by: §1.