Log In Sign Up

NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs. Microbatching is a standard method to alleviate this and is fully supported in the TensorFlow Privacy library (TFDP). However, this technique, while improving training times also reduces the quality of the gradients and degrades the classification accuracy. Recent works that for example use the JAX framework show promise in also alleviating this but still show degradation in throughput from non-private to private SGD on CNNs, and have not yet shown ImageNet implementations. In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.


page 1

page 2

page 3

page 4


Large Scale Transfer Learning for Differentially Private Image Classification

Differential Privacy (DP) provides a formal framework for training machi...

Scaling up Differentially Private Deep Learning with Fast Per-Example Gradient Clipping

Recent work on Renyi Differential Privacy has shown the feasibility of a...

AdaCliP: Adaptive Clipping for Private SGD

Privacy preserving machine learning algorithms are crucial for learning ...

Efficient Per-Example Gradient Computations in Convolutional Neural Networks

Deep learning frameworks leverage GPUs to perform massively-parallel com...

Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

A common pain point in differentially private machine learning is the si...

Large-Scale Differentially Private BERT

In this work, we study the large-scale pretraining of BERT-Large with di...

Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Deep learning models are increasingly popular in many machine learning a...

1 Introduction

Differentially private stochastic gradient descent 

Abadi et al. (2016)

(DPSGD) is a technique to train neural networks on sensitive, personal data while providing provable guarantees of privacy. Since  

Abadi et al. (2016), recent works have been limited to small datasets and networks in part due to computational challenges. Overcoming these challenges required a large mini-batch McMahan et al. (2016). However, differential privacy and especially larger mini-batches impact the privacy loss as well as the accuracy Bagdasaryan et al. (2019). Whereas the original implementation of DP-SGD was in TensorFlow Abadi et al. (2016, 2016); McMahan et al. (2018) more recent approaches use JAX with some success Anil et al. (2021); Subramani et al. (2020), and there are several other approaches to tackle the acceleration on the framework level Dupuy et al. (2021); Papernot et al. (2018); van der Veen et al. (2018)

This paper explores the application of DPSGD to the ImageNet dataset with a ResNet-50 He et al. (2016) architecture and analyzes the effect of some parameters on the accuracy, speed, and privacy on two very different hardware architectures. Recently, Graphcore’s Intelligence Processing Unit (IPU) has been introduced. Some key properties of the Mk2 IPU are that is has 1472 processor tiles, 896MiB on-chip SRAM, 7.8TB/s inter tile communication, and that it is a MIMD architecture Jia et al. (2019). This allows for fine-grained operations on the chip without excessive communication to host for fetching weights or instructions. Instead, in most cases, the instructions and intermediate activations reside on-chip. Thus in numerous applications, where other acceleration hardware is challenged, the IPU has shown significant performance advantages like EfficientNet Masters et al. (2021), approximate Bayesian computation Kulkarni et al. (2020), multi-horizon forecasting Zhang and Zohren (2021), bundle adjustment Ortiz et al. (2020), and particle physics Maddrell-Mander et al. (2021).

This paper focuses on processing ImageNet Russakovsky et al. (2014). We use this as a proxy for large scale image processing. However, there are numerous other applications that can benefit from it, like brain tumor segmentation Li et al. (2019); Sheller et al. (2019), cancer detection Mukherjee et al. (2020), and COVID-19 lung scan analysis Lee et al. (2021), and many more in clinical context Beaulieu-Jones et al. (2018); Geyer et al. (2017) which have grown in the number of training examples as well as the size each data. For example, in MRI datasets, there are multiple sequences per patient and each sequence contains volumetric data. Larger and more complex models are therefore needed.

After the introduction of DPSGD, it has also seen increased usage in natural language processing 

Anil et al. (2021); Dupuy et al. (2021); McMahan et al. (2017); Subramani et al. (2020). A common approach is to actually pretrain on a public dataset without privacy and then finetune on the privacy sensitive data Luo et al. (2021). Thus larger networks can be trained but the challenge of training on big data while still respecting privacy is not addressed.

An interesting aspect of image processing are the normalization techniques. A common approach for Resnet50 He et al. (2016)

is to use batch normalization. However, batch normalization mixes information across different data samples and thus violates privacy. Thus, we will stick to group norm 

Wu and He (2020) in this paper. Whereas batch norm has an optimal batch size between 32 and 64, group norm enables high accuracy with much lower batch sizes Masters and Luschi (2018). Recently, an alternative method, proxy norm Labatie et al. (2021), has been developed that combines the benefits of batch norm as the best approach to normalize data and group norm as the best approach that works even on batch-size of one and provides the capability of speeding up EfficientNet significantly Masters et al. (2021).

This paper addresses acceleration on the hardware level and image processing which has so far been underrepresented in the literature.

2 Experiments

2.1 Implementation

For GPU experiments we used the public ResNet-50 implementation. We replaced batch norm by group norm to address privacy preferences and used vmap with TFDP for the DPSGD part. For IPU experiments we used Graphcore’s provided public examples repository for CNNs and added the respective code for clipping and noising to obtain a DPSGD implementation. For simplicity, we obtain larger total batch sizes over gradient accumulation on Mk1 (DSS8440) and Mk2 (IPU-POD16).

2.2 Differential privacy measurements

In DPSGD, we randomly sample a batch of images at step , clip per-example gradients such that , accumulate clipped gradients over the entire batch and inject noise , where AND

are hyperparameters that establish the privacy budget

. The hyperparameters are usually chosen to maximize classification accuracy (utility) for a given . In our experiments, we fix the total number of epochs and measure both the accuracy and . We keep and

constant throughout training and across layers for all experiments. We use the moment accountant implemented in TFDP 

McMahan et al. (2018) to compute the noise budget. In both our Mk1 and Mk2 implementations, we clip and inject noise for each gradient example independently to encapsulate DPSGD within an already existing framework, but further throughput gains can be obtained with noising after accumulating. After clipping and noising, gradients are then used to update the parameters via SGD without momentum and with a stepped learning rate decay policy.

In our Mk1 experiments, the ResNet-50 model is pipelined and is split into 4 stages. We motivate experiments with pipelining as it has become an important part in building ever more larger and complex models with larger datasets. To ensure little degradation in throughput and to prevent limitations in pipelining schemes, we disallow any stage to communicate gradient signals (including gradient norm) to other stages. This means that to enable DPSGD, we clip the per-stage gradients of stage across stages independently while ensuring that the original gradient norm bound is respected. Namely, . In our experiments, we choose a simple uniform partitioning by imposing that each stage has . This places a tighter constraint on the gradient norm than for the non-pipelined case. The use of pipelining motivates future work into adaptive and layer-wise strategies.

2.3 Batch size and learning rate analysis

What is the optimal batch size that maximizes throughput? On the A100 GPU, we ran throughput experiments as shown in Table 2. For DPSGD, the optimal total batch size that maximizes throughput for the GPU is 8 independent of the micro-batch size, whereas for SGD, throughput is (usually) maximized for larger batch sizes.

While larger accumulation results in better gradient signal quality, smaller accumulation leads to more incremental, but noisier weight updates. In Fig. 1

, we illustrate this on a small ResNet-8 model on Cifar-10. Gradient accumulation counts of down to 1 (and batch size of 1) can achieve better convergence than with 16. Furthermore, larger batch sizes increases the privacy budget (TFDP 

McMahan et al. (2018)

) for a fixed number of epochs. The motivation for large batch sizes is to ensure large gradient signal-to-noise ratios with the hope of improving utility. However, in order to meet the same privacy budget as an equivalent experiment with smaller batch sizes, the number of epochs must be reduced.

Gradient accumulation count 16 8 4 2 1
Val. accuracy (%) 65.0 64.9 66.2 66.8 67.4
Table 1: Validation accuracy of ResNet-8 after training for 500 epochs on Cifar-10 for varying total batch sizes. The batch size is . Group normalization is used.

With this in mind, we perform ImageNet experiments on the Mk1, where we analyzed the interplay between gradient accumulation count, micro-batching, and two schemes of adjusting the learning rate to compensate for the total batch size. In the first case, we kept the initial learning rate set at 1.0, which was the highest that allowed for fast convergence without diverging. In the other case, the initial learning rate is scaled linearly with the change of the gradient accumulation count. For all experiments, the learning rate is scaled by in a step-wise manner after training for 25, 50, 75, 90 epochs. The accuracy results after 100 epochs of training are displayed in Figure 1. We can see that our learning rate scaling approach slightly improves performance and that increasing the micro-batch size from 1 to 2 clearly decreases performance. Hence, for the following experiments, we use a micro-batch size of 1.

Table 2: Throughput (img/s) comparison between DPSGD vs SGD on A100 with group norm for different BS and total BS combinations. total DPSGD BS SGD BS 1 2 4 1 24 28 8 62 96 133 201 16 43 75 121 315 32 23 95 82 392
Figure 1: Classification accuracy after 100 epochs of training for various gradient accumulation counts. Micro-batching is used with batch size 2 experiments.

2.4 Hardware comparison

Total 8 V100 GPUs 16 Mk1 IPUs 8 A100 GPUs 16 Mk2 IPUs
64 297 796 2612 2940 423 999 5003 6307
128 201 1261 2895 3348 313 1758 5474 7055
256 OOM 1685 3205 3482 177 2510 5744 7507
512 OOM 2035 3227 3590 OOM 3046 5892 7773
Table 3: Throughput comparison between different hardware for BS=1. DPSGD on GPUs uses with TFDP. Left part: previous chip generation. Right part: Latest chip generation.

Multiple publications address challenges in getting differential privacy to run fast. Hence, we compare two fundamentally different hardware architectures in this section: GPUs and IPUs. In Section 2.3, we showed that BS=1 delivers the best accuracy. Also, smaller micro-batch sizes result in better privacy. Thus, we focus on this setting for the comparison of different hardware and SGD and DPSGD. The total batch size is a result of the single device batch size and the number of replicas for GPUs. For the IPU, we use a local batch size of 1 and then use gradient accumulation and replicas for respective larger total batch sizes. We compare machines with 8 GPUs to 16 IPUs to match the systems TDP Watts and normal packaging setup.

The results are displayed in Table 3. On GPUs, DPSGD reduces performance by . On IPUs, there is only a reduction of on Mk1 and up to on Mk2. All compared hardware choices have an additional performance hit due to memory constraints by DPSGD and being able to run SGD with much higher batch sizes. Given that Mk2 and A100 are the successors of Mk1 and V100 and share the same lithographic node (TSMC 7N for Mk2/A100 and 12N for Mk1/V100), it is common to compare those pairs. All A100 experiments were performed on the Google Cloud Platform instance a2-highgpu-8g with 96 vCPUs and 680 GB memory, and the V100 experiments on a dual Intel Xeon Gold 6248 CPU with 755 GB memory Thinkmate workstation with 8 V100s. In this setting, GPUs are 8-11 times slower than IPUs.

This means that compared to the hour training to obtain on an Mk2 IPU-POD16 system, it would take at least days to obtain the same results with an A100 GPU.

2.5 Differential privacy result

A summary of DPSGD ImageNet experiments are shown in Table 4 with and without pipelining. Without pipelining, we achieve 71% accuracy with (). These values are in the expected range with maximum possible accuracy of and commonly observed epsilons in other applications. With pipelining, due to the tighter constraint on pipelined clipping, we achieve a slightly worse accuracy- trade-off than for the non-pipelined case.

Model tbs bs Hardware Pipeline Acc. Duration
ResNet-50 512 1 Mk2 16xIPUs no 11.4 71.2% 6
ResNet-50 64 1 Mk1 16xIPUs yes 14.2 69.6% 13.6
ResNet-50 128 2 Mk1 16xIPUs yes 18.4 65.2% 8.4
Table 4: Differential privacy results (final and accuracy at epoch 100) on ImageNet with different total batch size (tbs) and configurations. Duration is measured in time it takes to train on 100 epochs in hours.

3 Conclusion

In this paper, we showed that training DPSGD on a large dataset like ImageNet is not only feasible with a per-device batch size of 1 but smaller batch sizes are preferred. We show that we can train 100 epochs on ImageNet with ResNet-50 in hours on Graphcore’s Mk2 IPU-POD16, whereas comparable hardware takes 10 times longer.

In the future, we would like to establish an ImageNet benchmark for DPSGD, explore other optimizers, learning rate schedulers, and proxy norm to get faster convergence and thus better privacy guarantees. Also more acceleration improvements are of interest like reimplementing and running the experiments with the JAX framework. From the application point of view, we want to transfer the findings to federated learning and the analysis of sensitive data like lung scans for COVID-19 data analysis Lee et al. (2021).


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016)

    TensorFlow: a system for large-scale machine learning

    In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, USA, pp. 265–283. External Links: ISBN 9781931971331 Cited by: §1.
  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep Learning with Differential Privacy. Proceedings of the ACM Conference on Computer and Communications Security 24-28-Octo, pp. 308–318. External Links: Document, 1607.00133, ISBN 9781450341394, ISSN 15437221, Link Cited by: §1.
  • R. Anil, B. Ghazi, V. Gupta, R. Kumar, and P. Manurangsi (2021) Large-Scale Differentially Private BERT. arXiv. External Links: 2108.01624, Link Cited by: §1, §1.
  • E. Bagdasaryan, O. Poursaeed, and V. Shmatikov (2019) Differential Privacy Has Disparate Impact on Model Accuracy. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1.
  • B. K. Beaulieu-Jones, W. Yuan, S. G. Finlayson, and Z. S. Wu (2018) Privacy-Preserving Distributed Deep Learning for Clinical Data. arXiv. External Links: 1812.01484, Link Cited by: §1.
  • C. Dupuy, R. Arava, R. Gupta, and A. Rumshisky (2021) An Efficient DP-SGD Mechanism for Large Scale NLP Models. arXiv. External Links: 2107.14586, Link Cited by: §1, §1.
  • R. C. Geyer, T. Klein, and M. Nabi (2017) Differentially Private Federated Learning: A Client Level Perspective. NIPS 2017 Workshop: Machine Learning on the Phone and other Consumer Devices. External Links: 1712.07557, Link Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    Vol. 2016-Decem, pp. 770–778. External Links: Document, 1512.03385, ISBN 9781467388504, ISSN 10636919 Cited by: §1, §1.
  • Z. Jia, B. Tillman, M. Maggioni, and D. P. Scarpazza (2019) Dissecting the Graphcore IPU architecture via microbenchmarking. ArXiv abs/1912.03413. Cited by: §1.
  • S. Kulkarni, M. M. Krell, S. Nabarro, and C. A. Moritz (2020) Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19. ACM Journal on Emerging Technologies in Computing Systems. External Links: 2012.14332, Link Cited by: §1.
  • A. Labatie, D. Masters, Z. Eaton-Rosen, and C. Luschi (2021) Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. arXiv. External Links: 2106.03743, Link Cited by: §1.
  • E. H. Lee, J. Zheng, E. Colak, M. Mohammadzadeh, G. Houshmand, N. Bevins, F. Kitamura, E. Altinmakas, E. P. Reis, J. Kim, C. Klochko, M. Han, S. Moradian, A. Mohammadzadeh, H. Sharifian, H. Hashemi, K. Firouznia, H. Ghanaati, M. Gity, H. Doğan, H. Salehinejad, H. Alves, J. Seekins, N. Abdala, C. Atasoy, H. Pouraliakbar, M. Maleki, S. W. S, and K. W. Yeom (2021) Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT. npj Digital Medicine 4 (1), pp. 11. External Links: Document Cited by: §1, §3.
  • W. Li, F. Milletarì, D. Xu, N. Rieke, J. Hancox, W. Zhu, M. Baust, Y. Cheng, S. Ourselin, M. J. Cardoso, and A. Feng (2019) Privacy-Preserving Federated Brain Tumour Segmentation. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    Vol. 11861 LNCS, pp. 133–141. External Links: Document, 1910.00962, ISBN 9783030326913, ISSN 16113349 Cited by: §1.
  • Z. Luo, D. J. Wu, E. Adeli, and L. Fei-Fei (2021) Scalable differential privacy with sparse network finetuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5059–5068. Cited by: §1.
  • S. Maddrell-Mander, L. R. M. Mohan, A. Marshall, D. O’Hanlon, K. Petridis, J. Rademacker, V. Rege, and A. Titterton (2021) Studying the Potential of Graphcore® IPUs for Applications in Particle Physics. Computing and Software for Big Science 5 (1). External Links: Document, 2008.09210, ISSN 25102044 Cited by: §1.
  • D. Masters, A. Labatie, Z. Eaton-Rosen, and C. Luschi (2021) Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training. arXiv. External Links: 2106.03640, Link Cited by: §1, §1.
  • D. Masters and C. Luschi (2018) Revisiting Small Batch Training for Deep Neural Networks. arXiv. External Links: 1804.07612, Link Cited by: §1.
  • H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz (2018) A General Approach to Adding Differential Privacy to Iterative Training Procedures. In NeurIPS 2018 workshop on Privacy Preserving Machine Learning, External Links: 1812.06210, Link Cited by: §1, §2.2, §2.3.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas (2016) Communication-Efficient Learning of Deep Networks from Decentralized Data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017. External Links: 1602.05629, Link Cited by: §1.
  • H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang (2017)

    Learning Differentially Private Recurrent Language Models

    6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. External Links: 1710.06963, Link Cited by: §1.
  • P. Mukherjee, M. Zhou, E. Lee, A. Schicht, Y. Balagurunathan, S. Napel, R. Gillies, S. Wong, A. Thieme, A. Leung, and O. Gevaert (2020)

    A shallow convolutional neural network predicts prognosis of lung cancer patients in multi-institutional computed tomography image datasets

    Nature Machine Intelligence 2 (5), pp. 274–282. External Links: Document Cited by: §1.
  • J. Ortiz, M. Pupilli, S. Leutenegger, and A. J. Davison (2020) Bundle Adjustment on a Graph Processor. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2416–2425. External Links: Link Cited by: §1.
  • N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and Ú. Erlingsson (2018) Scalable Private Learning with PATE. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. External Links: 1802.08908, Link Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2014) ImageNet Large Scale Visual Recognition Challenge. External Links: 1409.0575, Link Cited by: §1.
  • M. J. Sheller, G. A. Reina, B. Edwards, J. Martin, and S. Bakas (2019) Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11383 LNCS, pp. 92–104. External Links: Document, 1810.04304, ISBN 9783030117221, ISSN 16113349 Cited by: §1.
  • P. Subramani, N. Vadivelu, and G. Kamath (2020)

    Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

    arXiv. External Links: 2010.09063, Link Cited by: §1, §1.
  • K. L. van der Veen, R. Seggers, P. Bloem, and G. Patrini (2018) Three Tools for Practical Differential Privacy. arXiv. External Links: 1812.02890, Link Cited by: §1.
  • Y. Wu and K. He (2020) Group Normalization. International Journal of Computer Vision 128 (3), pp. 742–755. External Links: Document, 1803.08494, ISSN 0920-5691, Link Cited by: §1.
  • Z. Zhang and S. Zohren (2021) Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units. arXiv. External Links: 2105.10430, Link Cited by: §1.