Towards Quantifying the Carbon Emissions of Differentially Private Machine Learning

07/14/2021 ∙ by Rakshit Naidu, et al. ∙ 45

In recent years, machine learning techniques utilizing large-scale datasets have achieved remarkable performance. Differential privacy, by means of adding noise, provides strong privacy guarantees for such learning algorithms. The cost of differential privacy is often a reduced model accuracy and a lowered convergence speed. This paper investigates the impact of differential privacy on learning algorithms in terms of their carbon footprint due to either longer run-times or failed experiments. Through extensive experiments, further guidance is provided on choosing the noise levels which can strike a balance between desired privacy levels and reduced carbon emissions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rising availability of large-scale, diverse datasets, performance of Machine Learning (ML) models have experienced a significant boost across a multitude of domains. This boost is also associated with the availability of extreme-scale datasets, which is heavily linked to individual user contributions achieved via crowd-sourcing. ML algorithms often perform operations directly on raw user data leading to a host of privacy violations. Differential Privacy (DP) (Dwork and Roth, 2014; Abadi et al., 2016) makes progress in this domain by providing strong privacy guarantees for such contributing individuals. This guarantee is achieved by means of noise addition, which can be done at various stages of the ML pipeline including : (1) Local DP: Addition to the raw data (Cormode et al., 2018) (2) Gradient DP: Addition to gradients after clipping (Abadi et al., 2016) (3) Addition to Output & Objective DP:

Addition to the final ML model or the loss function

(Chaudhuri et al., 2011).

1.1 Impact on Climate Change

It is well-known that the computational resource investment requisite for training ML models generates a carbon footprint. This footprint is amplified in privacy-preserving setups where it is harder to reach consistent accuracy due to the addition of noise. Extended and failed runs (especially on larger datasets) actively contribute to an increase in the carbon footprint of ML experiments (Strubell et al., 2019). Therefore, an analysis of the climatic impact of this privacy modulation is critical. While the existing DP literature studies several performance aspects affected by varying privacy requirements, it lacks a comprehensive quantification of the carbon footprint of DP and how it is affected by variable privacy levels. Since DP also provides a mathematical paradigm to quantify the privacy budget of training ML models while tracking the privacy usage across multiple runs, this paper aims at quantifying the Carbon Emissions (CE) associated with varying privacy budgets of differentially private networks. In order to study impact of DP on these emissions, we implement Gradient DP (DP-SGD (Abadi et al., 2016)

) for natural language processing, image classification, and reinforcement learning domains to identify the privacy implications, model performance and most crucially the carbon footprint of each algorithm. As per our knowledge this is the first attempt to quantitatively benchmark the carbon footprint of differentially private ML models.

1.2 Differential Privacy

Definition 1: Given a randomized mechanism (with domain and range ) and any two neighboring datasets (i.e. they differ by a single individual data element), is said to be -differentially private for any subset 111In this work, we exclusively use Gaussian noise.

(1)

Here, . A case corresponds to pure differential privacy, while both leads to an infinitely high privacy domain. Finally, provides no privacy guarantees. For practical purposes we want where is the number of samples in the dataset. (Dwork and Roth, 2014)

The privacy of differentially private models can be quantified with parameters such as epsilon () and delta (). Utilizing DP-SGD (Abadi et al., 2016), that is, adding noise to the gradients at each step during training using a clipping factor () and noise multiplier (), the amount of noise added to the model can be linked to to the degree of privacy that the model can achieve. Theoretically, a lower value of indicates a higher degree of privacy and this increased privacy degree is understandably, achieved at the expense of model performance due to the addition of the noise. The practical implication of this, however, includes a direct impact on the computational resources required to achieve model performance. Reduced privacy requirements allow the addition of noise with limited power, and hence, models can achieve appropriate performance without any significant resource expense. On the other hand, high privacy requirements necessitate adding a significantly large magnitude of noise which may directly lead to an increase in the number of training passes that the model has to iterate over to achieve the same accuracy. Further, noise addition may even lead to the non-convergence of some systems in the worst case.

1.3 Related Work

Works such as (Strubell et al., 2019; Toews, 2020) discuss how conventional Machine Learning models impact carbon footprint. In particular, (Strubell et al., 2019)

discusses how training a single Deep Learning model generates the total lifetime carbon footprint of nearly five cars (as mentioned in

(Toews, 2020)) which is more than 17 times the amount of CO emissions generated by an average American per year. Regarding DP, there has been very little considerations on how Privacy-Preserving Machine Learning (PPML) impacts climate change. In (Qiu et al., 2021), a comprehensive study is presented on how local client-side models in Federated learning (FL) could potentially hold quality data required to understand climate change given data privacy concerns due to recent policies like GDPR (Skendžić et al., 2018). However, running local models on multiple client devices and aggregating them globally at the server level requires additional infrastructure in place, thereby causing a detrimental effect on carbon emissions.

1.4 Contributions and Impacts

In this paper, we provide the first benchmark to quantitatively assess how DP-noise affect carbon emissions in three different tasks : (1) a Natural Language Processing (NLP) task using news classification (2) a Computer Vision (CV) task using the MNIST dataset and (3) a Reinforcement Learning (RL) task using the Cartpole control problem. Intuitively, when DP noise is added to ML pipelines, the carbon emissions should increase as the energy required for computations increase due to rising number of epochs required for convergence. In order to quantify how the addition of noise plays into climate change, we track carbon emissions in the models using the

codecarbon tool (Schmidt et al., 2021), a joint effort from authors of (Lacoste et al., 2019) and (Lottick et al., 2019). We record the average accuracy of several runs of the considered ML task to assess the behavior of DP-noise.

Noise for masking data has been widely used in adversarial machine learning

(Kurakin et al., 2016). Given the rise in Privacy-enhancing Technologies and privacy policies, noise addition has now become prevalent in the context of DP. We envisage this work to provide an insight on how much noise could result in varying amounts of CO emissions. Hence, our work takes a peek at how the addition of noise could impact a number of industries from healthcare to finance and justice, where sensitive data is commonly in use.

2 Experimental Results

2.1 Bert

In these set of experiments, we evaluate the performance of two experiments on Bidirectional Encoder Representations from Transformers or BERT (Devlin et al., 2019). The model is fine-tuned for topic-classification of news articles. The primary objective of these experiments is to observe the carbon emissions and energy usage of vanilla BERT and DP-BERT (over different privacy levels).

A randomly down-sampled subset (15,000 samples) of the AG News Classification

(Anand, 2020) is used for this task with a 80/20 train-test split. We use BERT with the AdamW optimizer with the bert-base-cased tokenizer (batch size () of , Learning Rate being 0.0005) to conduct the following experiments for this task.

2.1.1 Carbon emissions and energy consumed

The aim of this experiment is to analyse any possible association between different levels of privacy and carbon emissions. We run these experiments for 10 epochs each and present our results in Table 1 (averaged over 3 runs). Curiously, the carbon emissions for the case is comparable to the EU’s 2021 passenger vehicle standard (Bandivadekar, 2013). The difference between a private model () and a non-private model () is approximately 1g of CO, which is equivalent to the emissions from five Google searches (Sterling, 2009).

Epsilon () CE (g) EC (Wh) Accuracy (%)
0.5 26.7 0.63 49.9 1.2 48.5 1.39
2 26.3 0.49 49.3 0.9 52.0 0.73
5 26.1 0.1 48.9 0.9 52.3 0.36
15 25.9 0.09 48.5 0.1 54.2 1.40
(Non-Private) 25.2 0.00 47.1 0.27 58.5 5.29
Table 1: DP-BERT: Emission-Accuracy trends over change in for reaching 52% accuracy.

In congruence with existing literature, the accuracy of the differentially private BERT increases consistently with the increase in epsilon. Interestingly, with the increase in the epsilon value – both, CE and EC decrease, though not by a very significant margin. Given that the range of the chosen varies considerably, the consequent difference in the carbon emission is not proportionally varied. The practical implication of this invariance can be seen as incurring nearly the same carbon footprint for two versions of a model with different degrees of privacy.

Epsilon () Epochs CE (g) EC (Wh)
0.5 19 153.6 287.3
2 12 96.6 180.6
5 9 80.9 151.3
15 7 56.9 106.5
(Non-Private) 6 8.5 16
Table 2: Observing the number of epochs needed to achieve the threshold test accuracy () with different privacy levels

2.1.2 Resource Expense Analysis

The main aim of this experiment is to evaluate how many resources, in terms of consequent carbon and energy emissions are expended in order to achieve a target or threshold accuracy with different degrees of privacy. As defined in the previous set of experiments, we compute the accuracies over .We set the target/threshold accuracy () to as shown in Table 2.

It can be inferred from Table 2 that the Carbon Emission and Energy Usage required to attain the maximum experimental value of privacy is nearly 18 times the carbon emission required to attain the same threshold accuracy with a non-privacy preserving variant of the model. The practical consequence of this experiment dictates that enhancing the degree of privacy of the model, can incur a huge compute cost, which can invariably increase the carbon footprint of the model’s training pipeline.

(a) BERT: Training Accuracy
(b) BERT: Testing Accuracy
Figure 1: BERT with Gaussian DP: Training and Testing accuracy trends over change in where the threshold accuracy () is set to 52%.

Additionally, From Figure 1, which present the accuracy curves for the experiment, it is quite evident that the vanilla variant (i.e a model without DP-noise) achieves the threshold accuracy with a significantly smaller carbon footprint than all the footprint of its privacy-preserving variants.

2.2 Mnist & Cifar

Epsilon () CE (g) EC (Wh)
0.5 * 10.53 2.21 40.41 0.93
2 * 10.6 2.43 40.5 0.53
5 7.85 1.84 29.93 0.46
15 1.61 0.37 6.17 0.27
(Non-Private) 0.08 7e-04 0.38 3.3e-03
Table 3: MNIST: Emission trends over change in for reaching 70% accuracy (* 70% accuracy not reached even after 200 epochs.)
Epsilon () CE (g) EC (Wh)
0.5* 15.34 70.216
2 12.48 57.108
5 3.12 14.307
15 2.03 9.297
(Non-Private) 0.36 1.678
Table 4: CIFAR: Emission trends over change in for reaching 55% accuracy in a single run (* 55% accuracy not reached even after 30 epochs.)

We evaluate our approach on the MNIST dataset (LeCun and Cortes, 2010) with a batch size of 128 using DP-SGD (Abadi et al., 2016)

. We use a simple multi-layer perceptron (MNIST 2NN) with a two hidden layers of 200 units each (parameters = 199,210) as the network from

(McMahan et al., 2017). Our goal is to observe the trend in the CO emissions by allowing the model to train and reach accuracy with different values of (different levels of privacy). We compute the accuracies over as shown in Fig. 2. We set the target/threshold accuracy () to so that most of the privacy-variant models can achieve under iterations. In Fig. 2 we see that only models with reach 70% accuracy within 200 epochs. Fig. 2 shows a clear trend on how increasing levels of privacy in ML models increases the amount of computation required to reach , thereby releasing higher carbon emissions.

For the CIFAR-10 experiments, we use the ResNet18 model pre-trained on the ImageNet dataset. This deep variant of CNN is chosen instead of the simple MNIST 2NN in order to capture more complex and realistic features. We compute the accuracies over

as shown in Fig. 3. We set the target/threshold accuracy () to as most of the private models can achieve under iterations. We run the CIFAR experiments only for 30 iterations as we observe that there is little to no significant improvement beyond this. In Fig. 3, We observe similar trends as Fig. 2 on how increasing privacy levels involve higher computations to reach which in turn, release higher carbon emissions.

(a) MNIST: Accuracy on Training Set
(b) MNIST: Accuracy on Test Set
Figure 2: MNIST with Gaussian DP: Training and test accuracy trends during training for multiple values.
(a) CIFAR-10: Accuracy on Training Set
(b) CIFAR-10: Accuracy on Test Set
Figure 3: CIFAR-10 with Gaussian DP: Training and test accuracy trends during training for multiple values.

2.3 Cartpole

For the reinforcement learning experiments, we trained a DQN over OpenAI Gym’s Cartpole-v0 environment. Due to page restrictions we defer the discussion of the RL experiments to the Appendix B.

3 Conclusion

We demonstrate and highlight the prominent impact of Privacy-Preserving Machine Learning (PPML) on carbon emissions over three ML domains, namely, CV, NLP and RL. We observe that the stronger privacy regime, i.e, a lower

value, ML algorithms always result in higher levels of carbon emissions in the CV and NLP domains. Curiously, results for RL are less obviously explained and we defer these discussions for future work. We conclude that alongside the challenge of obtaining state-of-the-art performance, PPML needs to reduce the number of epochs required to reach the desired performance. This leads us to the following critical questions which we leave as open questions for the future: (1) Can we reduce the number of iterations (including hyperparameter tuning) required to reach a privacy-utility ratio? (2) How much does the size of ML models affect the carbon emissions and the overall performance under PPML?

4 Acknowledgements

We thank Fatemehsadat Mireshghallah for her valuable inputs and discussions.

References

  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. External Links: ISBN 9781450341394, Link, Document Cited by: §1.1, §1.2, §1, §2.2.
  • A. Anand (2020) AG news classification. External Links: Link Cited by: §2.1.
  • A. Bandivadekar (2013) One (vehicle efficiency) table to rule them all. Note: https://theicct.org/blogs/staff/one-vehicle-efficiency-table-rule-them-allAccessed: 2021-5-31 Cited by: §2.1.1.
  • A. G. Barto, R. S. Sutton, and C. W. Anderson (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (5), pp. 834–846. External Links: Document Cited by: §B.1.
  • K. Chaudhuri, C. Monteleoni, and A. D. Sarwate (2011) Differentially private empirical risk minimization.. Journal of Machine Learning Research 12 (3). Cited by: §1.
  • G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang (2018) Privacy at scale: local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pp. 1655–1658. Cited by: §1.
  • W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos (2017)

    Distributional reinforcement learning with quantile regression

    .
    External Links: 1710.10044 Cited by: §B.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.1.
  • C. Dwork and A. Roth (2014) The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3–4), pp. 211–407. External Links: ISSN 1551-305X, Link, Document Cited by: §1.2, §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1.4.
  • A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres (2019) Quantifying the carbon emissions of machine learning. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019. Cited by: §1.4.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §2.2.
  • K. Lottick, S. Susai, S. A. Friedler, and J. P. Wilson (2019) Energy usage reports: environmental awareness as part of algorithmic accountability. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019. Cited by: §1.4.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. External Links: 1602.05629 Cited by: §2.2.
  • X. Qiu, T. Parcollet, J. Fernández-Marqués, P. P. B. de Gusmão, D. J. Beutel, T. Topal, A. Mathur, and N. D. Lane (2021) A first look into the carbon footprint of federated learning. CoRR abs/2102.07627. External Links: Link, 2102.07627 Cited by: §1.3.
  • V. Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni (2021)

    CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing

    .
    Note: https://github.com/mlco2/codecarbon External Links: Document Cited by: §1.4.
  • A. Skendžić, B. Kovačić, and E. Tijan (2018) General data protection regulation — protection of personal data in an organisation. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Vol. , pp. 1370–1375. External Links: Document Cited by: §1.3.
  • G. Sterling (2009) Calculating the carbon footprint of a google search. External Links: Link Cited by: §2.1.1.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in NLP. CoRR abs/1906.02243. External Links: Link, 1906.02243 Cited by: §1.1, §1.3.
  • R. Toews (2020) Deep learning’s carbon emissions problem. Forbes. External Links: Link Cited by: §1.3.
  • B. Wang and N. Hegde (2019) Privacy-preserving q-learning with functional noise in continuous state spaces. External Links: 1901.10634 Cited by: §B.1, Table 5.

Appendix A Hyperparameter Tuning

a.1 Bert

We set out to understand the impact of noise introduced by DP on the model convergence up until a threshold test accuracy. We primarily tune the epsilon values (for the privacy guarantee) and the number of epochs. For a fair comparison we use constant hyperparameters for each epsilon so as to to minimize the scope of confounding variables (optimally tuned hyperparameters per epsilon value) for BERT (Table 1).

Succeeding that though, since a lot of differentially private models are deployed in edge-focused setups where numerous clients can take hyperparameter optimization decisions locally - the experiments (Table 2) with the thresholding accuracy were conducted. On grounds of ablation, a subset of these locally available hyperparamters were chosen with the objective to evaluate the change in carbon and energy emissions when these hyperparameters are altered: the privacy guarantee (meant to be a globally decided parameter in a distributed setup) and the number of epochs (meant to be a locally decided parameter) are changed.

a.2 Mnist & Cifar

Similar to the BERT approach, we use consistent hyperparameters across different -privacy levels for the MNIST dataset.

However, due to serious model performance degradation on the CIFAR-10 dataset we use an alternate approach. We tweak a set of hyperparameters to achieve optimal performance at the respective -privacy levels when the model performs poorly. For instance, we observe a major drop in performance for only the non-private model when a constant learning rate of is used for both the private and non-private settings. So, we instead choose a learning rate of for the non-private model which leads to optimal performance.

In our experiments, we also use the RMS-PROP optimizer which leads to more stable results & faster convergence during training.

Appendix B Reinforcement Learning: Experiments & Discussion

b.1 Cartpole

The Cartpole-v0 environment (Barto et al., 1983) consists of an un-actuated joint to a cart. There are two possible actions which involve a force of +1 or -1 being applied to the cart along a friction-less track. The pole starts upright, with the goal of preventing it from falling over. For every time-step that the pole is upright, a reward of +1 is added to the total reward. However, if the pole exceeds 15 degrees from the vertical, or if the cart moves more than 2.4 units from the center, the episode ends.


Figure 4: CartPole with Gaussian DP: Episodes vs Rewards for the mean reward every 100 episodes

Figure 5: Acrobot with Gaussian DP: CO2 vs Epsilon values post training after a 1000 episodes

Figure 6: Acrobot with Gaussian DP: Episodes vs Rewards for the mean reward every episode for a 1000 episodes

The DQN’s configuration (including the hyperparameters) is the same as the one used in (Wang and Hegde, 2019), and we observed results similar to this paper, with one variant of DP model slightly outperforming the baseline as shown in 4

. It consists of a single hidden layer with 16 neurons. For our non-private experiment we obtained a mean reward of 19.94 and carbon emissions of 0.22 g on average (over a 1000 episodes). We provide results of the private variants in Fig. 

4. Our setup included multiple experiments.

  • [noitemsep,topsep=0pt]

  • Noise addition to DQN’s output layer only. (1)

  • Noise addition to both, the DQN’s output layer and its parameters. The noise added to the parameters is the averaged noise sampled from the noisebuffer function during the forward pass. (2)

We varied the value of the variance

of the distribution to observe its impact on the function approximated by the DQN. As expected, with increasing noise addition to the model (i.e., increasing value of ), we notice a drop in the average reward. Subsequently, the increased computations lead to higher carbon emissions. We observe that there is a significant increase in CE from Table 6 to Table 7 when the number of episodes increase.

b.2 Acrobot

We alternatively run experiments on the Acrobot-v1 environment. Acrobot-v1 has 2 joints and links and the joint between the links is actuated. The environment’s initial position has the links in a resting state, hanging downward, with the objective being to swing the end of the lowermost link up to a specified height. There are 3 potential actions, to apply positive torque, to apply negative torque, or to do nothing. The agent attempts to maximize its reward within 500 time-steps, with each time-step potentially incurring a punishment of -1, thus making -500 the worst possible reward. If the agent reaches the given height, a reward of 0 is returned. The state returned consists of the joint angular velocities and the sin and cos values of the two rotational joint angles.

For the purpose of training the Acrobot environment, we used distributed reinforcement learning with quantile regression (Dabney et al., 2017)

. Quantile regression models the distribution of the returned rewards instead of only considering the mean of the distribution. We use the Opacus library in PyTorch to add the necessary DP guarantees. From Fig.

5, we observe a consistent upward trend of the CO2 emissions with the epsilon value. This directly implies that better privacy guarantees lead to higher carbon emissions. As a sanity check, we can observe (from Table 8) that the mean reward increases with higher epsilons (lower privacy requirement) due to reduced noise levels.

() Sigma () Mean Reward CE (g)
1 15 4.5 ± 0.6 1.03 ± 0.06
3 5 2.2 ± 0.2 0.96 ± 0.03
7.5 2 19.9 ± 0.5 1.14 ± 0.06
30 0.5 19.4 ± 0.1 1.15 ± 0.06
Table 5: CartPole: Emission trends over change in post 1000 episodes in (1) following (Wang and Hegde, 2019)
() Sigma () Mean Reward CE (g)
1 15 2.3 ± 0.9 0.41 ± 0.01
3 5 10.2 ± 0.8 0.5 ± 0.02
7.5 2 7.6 ± 0.7 0.45 ± 0.02
30 0.5 13.8 ± 0.1 0.48 ± 0.03
Table 6: CartPole: Emission trends post 1000 episodes in (2)
() Sigma () Mean Reward CE (g)
1 15 13.2 ± 0.3 3.51 ± 0.26
3 5 13.7 ± 0.9 2.31 ± 0.28
7.5 2 18.1 ± 0.1 2.72 ± 0.23
30 0.5 19.8 ± 0.6 4.0 ± 0.31
Table 7: CartPole: Emission trends post 5000 episodes in (2)
Epsilon Mean Reward CO2
0.5 -150.7 ± 18.69 3.25 ± 0.372
2 -138.0 ± 11.15 3.0 ± 0.245
5 -120.6 ± 4.64 0.98 ± 0.035
15 -117.5 ± 2.45 0.92 ± 0.023
0 -110.5 ± 1.22 0.33 ± 0.023
Table 8: Acrobot-v1: Emission trends post a 1000 Episodes