Differentially private stochastic gradient descentAbadi et al. (2016)
(DPSGD) is a technique to train neural networks on sensitive, personal data while providing provable guarantees of privacy. SinceAbadi et al. (2016), recent works have been limited to small datasets and networks in part due to computational challenges. Overcoming these challenges required a large mini-batch McMahan et al. (2016). However, differential privacy and especially larger mini-batches impact the privacy loss as well as the accuracy Bagdasaryan et al. (2019). Whereas the original implementation of DP-SGD was in TensorFlow Abadi et al. (2016, 2016); McMahan et al. (2018) more recent approaches use JAX with some success Anil et al. (2021); Subramani et al. (2020), and there are several other approaches to tackle the acceleration on the framework level Dupuy et al. (2021); Papernot et al. (2018); van der Veen et al. (2018)
This paper explores the application of DPSGD to the ImageNet dataset with a ResNet-50 He et al. (2016) architecture and analyzes the effect of some parameters on the accuracy, speed, and privacy on two very different hardware architectures. Recently, Graphcore’s Intelligence Processing Unit (IPU) has been introduced. Some key properties of the Mk2 IPU are that is has 1472 processor tiles, 896MiB on-chip SRAM, 7.8TB/s inter tile communication, and that it is a MIMD architecture Jia et al. (2019). This allows for fine-grained operations on the chip without excessive communication to host for fetching weights or instructions. Instead, in most cases, the instructions and intermediate activations reside on-chip. Thus in numerous applications, where other acceleration hardware is challenged, the IPU has shown significant performance advantages like EfficientNet Masters et al. (2021), approximate Bayesian computation Kulkarni et al. (2020), multi-horizon forecasting Zhang and Zohren (2021), bundle adjustment Ortiz et al. (2020), and particle physics Maddrell-Mander et al. (2021).
This paper focuses on processing ImageNet Russakovsky et al. (2014). We use this as a proxy for large scale image processing. However, there are numerous other applications that can benefit from it, like brain tumor segmentation Li et al. (2019); Sheller et al. (2019), cancer detection Mukherjee et al. (2020), and COVID-19 lung scan analysis Lee et al. (2021), and many more in clinical context Beaulieu-Jones et al. (2018); Geyer et al. (2017) which have grown in the number of training examples as well as the size each data. For example, in MRI datasets, there are multiple sequences per patient and each sequence contains volumetric data. Larger and more complex models are therefore needed.
After the introduction of DPSGD, it has also seen increased usage in natural language processingAnil et al. (2021); Dupuy et al. (2021); McMahan et al. (2017); Subramani et al. (2020). A common approach is to actually pretrain on a public dataset without privacy and then finetune on the privacy sensitive data Luo et al. (2021). Thus larger networks can be trained but the challenge of training on big data while still respecting privacy is not addressed.
An interesting aspect of image processing are the normalization techniques. A common approach for Resnet50 He et al. (2016)
is to use batch normalization. However, batch normalization mixes information across different data samples and thus violates privacy. Thus, we will stick to group normWu and He (2020) in this paper. Whereas batch norm has an optimal batch size between 32 and 64, group norm enables high accuracy with much lower batch sizes Masters and Luschi (2018). Recently, an alternative method, proxy norm Labatie et al. (2021), has been developed that combines the benefits of batch norm as the best approach to normalize data and group norm as the best approach that works even on batch-size of one and provides the capability of speeding up EfficientNet significantly Masters et al. (2021).
This paper addresses acceleration on the hardware level and image processing which has so far been underrepresented in the literature.
For GPU experiments we used the public ResNet-50 implementation. We replaced batch norm by group norm to address privacy preferences and used vmap with TFDP for the DPSGD part. For IPU experiments we used Graphcore’s provided public examples repository for CNNs and added the respective code for clipping and noising to obtain a DPSGD implementation. For simplicity, we obtain larger total batch sizes over gradient accumulation on Mk1 (DSS8440) and Mk2 (IPU-POD16).
2.2 Differential privacy measurements
In DPSGD, we randomly sample a batch of images at step , clip per-example gradients such that , accumulate clipped gradients over the entire batch and inject noise , where AND
are hyperparameters that establish the privacy budget. The hyperparameters are usually chosen to maximize classification accuracy (utility) for a given . In our experiments, we fix the total number of epochs and measure both the accuracy and . We keep and
constant throughout training and across layers for all experiments. We use the moment accountant implemented in TFDPMcMahan et al. (2018) to compute the noise budget. In both our Mk1 and Mk2 implementations, we clip and inject noise for each gradient example independently to encapsulate DPSGD within an already existing framework, but further throughput gains can be obtained with noising after accumulating. After clipping and noising, gradients are then used to update the parameters via SGD without momentum and with a stepped learning rate decay policy.
In our Mk1 experiments, the ResNet-50 model is pipelined and is split into 4 stages. We motivate experiments with pipelining as it has become an important part in building ever more larger and complex models with larger datasets. To ensure little degradation in throughput and to prevent limitations in pipelining schemes, we disallow any stage to communicate gradient signals (including gradient norm) to other stages. This means that to enable DPSGD, we clip the per-stage gradients of stage across stages independently while ensuring that the original gradient norm bound is respected. Namely, . In our experiments, we choose a simple uniform partitioning by imposing that each stage has . This places a tighter constraint on the gradient norm than for the non-pipelined case. The use of pipelining motivates future work into adaptive and layer-wise strategies.
2.3 Batch size and learning rate analysis
What is the optimal batch size that maximizes throughput? On the A100 GPU, we ran throughput experiments as shown in Table 2. For DPSGD, the optimal total batch size that maximizes throughput for the GPU is 8 independent of the micro-batch size, whereas for SGD, throughput is (usually) maximized for larger batch sizes.
While larger accumulation results in better gradient signal quality, smaller accumulation leads to more incremental, but noisier weight updates. In Fig. 1
, we illustrate this on a small ResNet-8 model on Cifar-10. Gradient accumulation counts of down to 1 (and batch size of 1) can achieve better convergence than with 16. Furthermore, larger batch sizes increases the privacy budget (TFDPMcMahan et al. (2018)
) for a fixed number of epochs. The motivation for large batch sizes is to ensure large gradient signal-to-noise ratios with the hope of improving utility. However, in order to meet the same privacy budget as an equivalent experiment with smaller batch sizes, the number of epochs must be reduced.
|Gradient accumulation count||16||8||4||2||1|
|Val. accuracy (%)||65.0||64.9||66.2||66.8||67.4|
With this in mind, we perform ImageNet experiments on the Mk1, where we analyzed the interplay between gradient accumulation count, micro-batching, and two schemes of adjusting the learning rate to compensate for the total batch size. In the first case, we kept the initial learning rate set at 1.0, which was the highest that allowed for fast convergence without diverging. In the other case, the initial learning rate is scaled linearly with the change of the gradient accumulation count. For all experiments, the learning rate is scaled by in a step-wise manner after training for 25, 50, 75, 90 epochs. The accuracy results after 100 epochs of training are displayed in Figure 1. We can see that our learning rate scaling approach slightly improves performance and that increasing the micro-batch size from 1 to 2 clearly decreases performance. Hence, for the following experiments, we use a micro-batch size of 1.
2.4 Hardware comparison
|Total||8 V100 GPUs||16 Mk1 IPUs||8 A100 GPUs||16 Mk2 IPUs|
Multiple publications address challenges in getting differential privacy to run fast. Hence, we compare two fundamentally different hardware architectures in this section: GPUs and IPUs. In Section 2.3, we showed that BS=1 delivers the best accuracy. Also, smaller micro-batch sizes result in better privacy. Thus, we focus on this setting for the comparison of different hardware and SGD and DPSGD. The total batch size is a result of the single device batch size and the number of replicas for GPUs. For the IPU, we use a local batch size of 1 and then use gradient accumulation and replicas for respective larger total batch sizes. We compare machines with 8 GPUs to 16 IPUs to match the systems TDP Watts and normal packaging setup.
The results are displayed in Table 3. On GPUs, DPSGD reduces performance by . On IPUs, there is only a reduction of on Mk1 and up to on Mk2. All compared hardware choices have an additional performance hit due to memory constraints by DPSGD and being able to run SGD with much higher batch sizes. Given that Mk2 and A100 are the successors of Mk1 and V100 and share the same lithographic node (TSMC 7N for Mk2/A100 and 12N for Mk1/V100), it is common to compare those pairs. All A100 experiments were performed on the Google Cloud Platform instance a2-highgpu-8g with 96 vCPUs and 680 GB memory, and the V100 experiments on a dual Intel Xeon Gold 6248 CPU with 755 GB memory Thinkmate workstation with 8 V100s. In this setting, GPUs are 8-11 times slower than IPUs.
This means that compared to the hour training to obtain on an Mk2 IPU-POD16 system, it would take at least days to obtain the same results with an A100 GPU.
2.5 Differential privacy result
A summary of DPSGD ImageNet experiments are shown in Table 4 with and without pipelining. Without pipelining, we achieve 71% accuracy with (). These values are in the expected range with maximum possible accuracy of and commonly observed epsilons in other applications. With pipelining, due to the tighter constraint on pipelined clipping, we achieve a slightly worse accuracy- trade-off than for the non-pipelined case.
In this paper, we showed that training DPSGD on a large dataset like ImageNet is not only feasible with a per-device batch size of 1 but smaller batch sizes are preferred. We show that we can train 100 epochs on ImageNet with ResNet-50 in hours on Graphcore’s Mk2 IPU-POD16, whereas comparable hardware takes 10 times longer.
In the future, we would like to establish an ImageNet benchmark for DPSGD, explore other optimizers, learning rate schedulers, and proxy norm to get faster convergence and thus better privacy guarantees. Also more acceleration improvements are of interest like reimplementing and running the experiments with the JAX framework. From the application point of view, we want to transfer the findings to federated learning and the analysis of sensitive data like lung scans for COVID-19 data analysis Lee et al. (2021).
TensorFlow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, USA, pp. 265–283. External Links: Cited by: §1.
- Deep Learning with Differential Privacy. Proceedings of the ACM Conference on Computer and Communications Security 24-28-Octo, pp. 308–318. External Links: Cited by: §1.
- Large-Scale Differentially Private BERT. arXiv. External Links: Cited by: §1, §1.
- Differential Privacy Has Disparate Impact on Model Accuracy. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1.
- Privacy-Preserving Distributed Deep Learning for Clinical Data. arXiv. External Links: Cited by: §1.
- An Efficient DP-SGD Mechanism for Large Scale NLP Models. arXiv. External Links: Cited by: §1, §1.
- Differentially Private Federated Learning: A Client Level Perspective. NIPS 2017 Workshop: Machine Learning on the Phone and other Consumer Devices. External Links: Cited by: §1.
- Deep residual learning for image recognition. In , Vol. 2016-Decem, pp. 770–778. External Links: Cited by: §1, §1.
- Dissecting the Graphcore IPU architecture via microbenchmarking. ArXiv abs/1912.03413. Cited by: §1.
- Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19. ACM Journal on Emerging Technologies in Computing Systems. External Links: Cited by: §1.
- Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. arXiv. External Links: Cited by: §1.
- Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT. npj Digital Medicine 4 (1), pp. 11. External Links: Cited by: §1, §3.
Privacy-Preserving Federated Brain Tumour Segmentation.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11861 LNCS, pp. 133–141. External Links: Cited by: §1.
- Scalable differential privacy with sparse network finetuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5059–5068. Cited by: §1.
- Studying the Potential of Graphcore® IPUs for Applications in Particle Physics. Computing and Software for Big Science 5 (1). External Links: Cited by: §1.
- Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training. arXiv. External Links: Cited by: §1, §1.
- Revisiting Small Batch Training for Deep Neural Networks. arXiv. External Links: Cited by: §1.
- A General Approach to Adding Differential Privacy to Iterative Training Procedures. In NeurIPS 2018 workshop on Privacy Preserving Machine Learning, External Links: Cited by: §1, §2.2, §2.3.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017. External Links: Cited by: §1.
Learning Differentially Private Recurrent Language Models. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. External Links: Cited by: §1.
A shallow convolutional neural network predicts prognosis of lung cancer patients in multi-institutional computed tomography image datasets. Nature Machine Intelligence 2 (5), pp. 274–282. External Links: Cited by: §1.
- Bundle Adjustment on a Graph Processor. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2416–2425. External Links: Cited by: §1.
- Scalable Private Learning with PATE. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. External Links: Cited by: §1.
- ImageNet Large Scale Visual Recognition Challenge. External Links: Cited by: §1.
- Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11383 LNCS, pp. 92–104. External Links: Cited by: §1.
Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization. arXiv. External Links: Cited by: §1, §1.
- Three Tools for Practical Differential Privacy. arXiv. External Links: Cited by: §1.
- Group Normalization. International Journal of Computer Vision 128 (3), pp. 742–755. External Links: Cited by: §1.
- Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units. arXiv. External Links: Cited by: §1.