1. Introduction
Deep neural networks (DNNs)
(Goodfellow et al., 2016) can learn very useful patterns from large multidimensional datasets, enabling motivational applications; e.g., in health (Ziller et al., 2021; Malekzadeh et al., 2021). However, large amounts of training data are required for not only learning the nearoptimal DNN parameters for the underlying task, but also for finding the right set of hyperparameters that enable appropriate learning. For a task defined on public datasets, the same data can be reused as many times as we wish. But, as every reuse of the available data comes at a price of some privacy loss, hyperparameter tuning has been a fundamental challenge for tasks defined on private datasets.Differential Privacy (DP) (Dwork et al., 2006) provides strong guarantees for the individuals participating in private datasets. DP restricts the maximum contribution of each sample on the result of a computation on the private dataset. Differentiallyprivate stochastic gradient descent (DPSGD) (Abadi et al., 2016)
is a widely accepted algorithm for training DNNs on private datasets, where zeromean Gaussian noise, with a predefined variance, is added to the clipped gradients computed for each sample in the training dataset at each iteration. Noisy gradients often result in a degraded accuracy for the trained DNN.
Previous works look at two variants: (1) optimizing privacy parameters of a private model for achieving comparable performance to a nonprivate model and (2) providing privacy guarantees to reach moderate performance (van der Veen et al., 2018). However, in practice, both hyperparameters and privacy parameters need to be optimized within the userspecified privacy budget. Thus, in this paper, we propose a systematic study for learning hyperparameters faster (constrained by a privacy budget) and with less privacy cost through four different optimization algorithms.
Although there is a wide range of hyperparameters that one can choose from in DPSGD (e.g., noise multiplier, clipping factor, batch size, learning rate, etc.), in this paper, we specifically focus on two important hyperparameters: noise multiplier (i.e.,
the standard deviation of the Gaussian noise) and
learning rate . We optimize for these two parameters specifically as the epsilon ( i.e. privacy leakage) and validation loss (minimizing the validation loss) are highly dependent on the chosen values for and , respectively. To this end, we study three costeffective algorithms: evolutionary, Bayesian, and reinforcement learning and compare the results with the grid search baseline. We also display consistent results across these techniques and provide insights on which algorithm could pave the path towards better hyperparameter tuning in DPSGD. We also opensource our code to enable future practitioners’ optimize for specific hyperparameters, according to their requirements.2. Related Work
The most widely used methods for hyperparameter tuning in deep learning are manual search, random search, and grid search (Tramer and Boneh, 2021). Manual search refers to the manual tuning of hyperparameters by individuals experimenting on a deep neural network. This method is frequently used due to prior experiences and intuition. On the other hand, random search provides a path towards hyperparameter space exploration. However, it is nonexhaustive and may not be able to discover highperforming hyperparameters. Thus, grid search is utilized to provide a sufficient exploration within a restricted search space. Although, due to its nonadaptive nature (i.e., the hyperparameter sets selected to be evaluated are not selected using already available results), it utilizes abundant resources and requires significant computational time.
Not all common practices in training deep models are always directly applicable when we apply DPSGD. For instance, (Papernot et al., 2021) shows that while ReLU
is the most common activation function in conventional deep learning, for DPSGD a bounded activation function, such as
tanh or tempered sigmoid, is more efficient. Also, (van der Veen et al., 2018) argues that while SGD hyperparameters and DP parameters are considered independent in the original DPSGD paper (Abadi et al., 2016), this is not a reasonable assumption. For example, by choosing a smaller batch size, we can achieve better privacy, but to keep accuracy well, we thus need to also reduce the learning rate accordingly. However, smaller learning rates usually slow down the convergence (Bengio, 2012), thus we need more epochs and hence more privacy loss in DPSGD. Therefore,
(van der Veen et al., 2018) suggests using a public dataset first, to find an appropriate DNN architecture with an optimized set of hyperparameters, and then train the model on the private dataset. However, (van der Veen et al., 2018) did not propose any method for searching over DPSGD hyperparameters.Model selection in multivariate linear regression under the constraint of differential privacy is studied in
(Lei et al., 2018); based on penalized least squares and likelihood. Especially, (Lei et al., 2018) reports that under differential privacy, the procedure of model selection becomes more sensitive to the tuning parameters. Moreover, the appropriate choice of tuning parameters requires some additional information in the data, and it is mentioned as a future topic in (Lei et al., 2018)to develop differentially private methods to estimate these hyperparameters.
(Pichapati et al., 2019) introduces AdaCliP, which achieves the same privacy guarantee with much less added noise by using coordinatewise adaptive clipping of the gradient. As the convergence of DPSGD depends on the variance of the gradient, AdaCliP also improves the accuracy of the trained model. While such an adaptive clipping provides better tradeoffs, it needs to estimate the variance and thus introduces four new hyperparameters by itself, making our problem more complicated. Similarly, (Andrew et al., 2019)introduces a method for adaptively tuning the clipping threshold to track a given “quantile” of the update norm distribution during training. Again, this method also needs to tune a new hyperparameter in the range of
.The DPareto algorithm is proposed in (Avent et al., 2020), where Bayesian optimization is used for hyperparameter tuning. The paper describes the Pareto Front and empirically validates its application using different neural network architectures across two datasets. Their study uses the multiobjective Bayesian optimizer to find the best hyperparameters, utilizing hypervolume to find the relative merit of different objectives. On the other hand, we use a singleobjective Bayesian optimizer for our study, which reduces computational costs and aims to optimize the reward function defined by us.
(Yazdanbakhsh et al., 2018) uses reinforcement learning to efficiently tune hyperparameters needed for quantization of deep neural networks and find the bitwidths for weights of each layer that would provide optimal computationaccuracy tradeoff.
3. Methodologies
3.1. Problem Formulation
We consider the problem of training a DNN with a fixed architecture (i.e., the number, type, and size of each layer) using DPSGD. Let denotes the training set and denote the validation set. Let denotes the set of hyperparameters that are used during training on and have impact on both validation loss () and privacy loss in DP (), on . To provide a general but customizable framework, we define reward as a weighted linear combination of and :
(1) 
We use regularizers and to control the importance of utility and privacy, respectively (to control the privacyutility tradeoff). In our proposed framework, we first set these regularizers, and then start searching for the optimal hyperparameters in using the algorithms explained in the following section. In this paper, we consider = {, }, where denotes the noise multiplier and denotes the learning rate; in DPSGD. Our aim for the following experiments remains to optimize the reward given by Equation (1). Notice that, in practice, the value of and depends on the requirements of the underlying task.
3.2. Evolutionary optimization
Evolutionary optimization algorithms provide an opportunity to explore as well as exploit the hyperparameter search space (Bochinski et al., 2017; Young et al., 2015). Its random initialization and mutational attributes allow it to take advantage of the random search optimization algorithm. In contrast, its adaptive nature enables it to exploit critical values, which give better results. In our implementation, we encoded each hyperparameter as a gene, a set of hyperparameters made up the genome of an individual, i.e., the experiment. The range and precision for each hyperparameter are predetermined, allowing the optimization algorithm to search within a limited search space. The initial population is determined by random sampling of hyperparameters from this search space. Once aggregated, the population is trained, and a fitness score or reward is measured using 1. Subsequently, each generation is formed using selection, crossover, and mutation based on the individuals with the highest fitness from the previous generation. The methodology can be optimized for high exploitation, thereby reducing resource wastage (Young et al., 2015).
Method  Time  Best Reward  Accuracy  Epsilon () 

(in hours)  (in %)  (in %)  
(A) CIFAR10  
Grid Search  150.020  51.406  44.936  0.600 
Evolutionary  11.064  52.044  37.999  0.599 
Bayesian  49.636  51.846  43.864  0.581 
Reinforcement  52.971  52.398  44.884  0.590 
(B) MNIST  
Grid Search  43.712  72.260  89.133  0.683 
Evolutionary  5.250  72.615  73.745  0.175 
Bayesian  2.853  73.385  81.562  0.349 
Reinforcement  31.165  74.906  75.022  0.240 
3.3. Bayesian Optimization
Bayesian optimization treats neural network training and performance as a blackbox function. It combines prior experience with the blackbox neural network with sample information to approximate the function distribution using the Bayesian formula (Wu et al., 2019). Based on this estimation of the function distribution, optimal values can be extrapolated. The estimate distribution is effectively a probabilistic model for the function, which exploits this model to decide where to explore the function next while integrating out uncertainty (Snoek et al., 2012). This methodology allows one to find the minima of complex nonconvex functions with relatively few evaluations. However, this is only due to our assumption that the function is drawn from a Gaussian process prior. Our experiments utilized Hyperopt, a Sequential ModelBased Optimization (SMBO) that provides high performance at a low computational budget (Bergstra et al., ).
3.4. Reinforcement Learning
Evolutionary optimization and Bayesian optimization provide both adequate methodologies for search space exploration and exploitation. However, this classical problem can also be dealt with by reinforcement learning. In our application of this method, we begin by initializing a regression network capable of estimating the reward output of training on a particular set of hyperparameters. We start by sampling a random collection of hyperparameters used to train the DPSGD model and obtain the reward to fit the regression network. We then proceed to extract the estimated reward of the entire search space for our hyperparameter tuning. The bestperforming hyperparameters are obtained from this estimation. These hyperparameters are mutated to hyperparameters in near proximity to them for the next episode, thereby allowing the model to exploit values that may give high performance. Subsequently, in the following episodes, we select a certain percent of experiments based on the reward estimate of the regression network; in contrast, the others continue to be randomly sampled. This is determined based on the epsilondecreasing strategy, where the value of explorationexploitationepsilon decreases as the experiment progresses. This methodology allows us to estimate the hyperparameterreward function and verify the proximal search space of highperforming hyperparameters, giving us generalized results.
4. Evaluation
For the experiments in this section, we used a Tesla P100 16GB as GPU, with 13GB RAM Intel Xeon as CPU for our experiments. Note that the random seed is fixed across all experiments for uniformity and reproducibility purposes. In the rest of this section, we will discuss the benchmarks used and the results of each experiment.
4.1. Benchmarks
To assess and analyze the effectiveness of optimization algorithms across both CIFAR10 and MNIST datasets, we use GridSearch on a similar search complexity as the other methods. We display the computational time taken, best reward achieved, and its respective accuracy and epsilon value in Table 1. Grid search displays a poor understanding of the epsilonaccuracy as it is not adaptive in nature. It achieves a reward of 72.2% and 51.4% on the MNIST and CIFAR10 datasets, respectively.
4.2. Optimization Algorithms
As described in earlier sections, we ran our experiments over three distinct optimization algorithms. Here, we observe that although Reinforcement Learning provides the highest performance, it comes at the expense of computational time. On the other hand, Evolutionary Algorithms and Bayesian Optimization provide consistent results with respect to computational time and performance.
Additionally, we can see from Table 1 that although Grid Search returns a highly accurate model, it is compensated by the high privacy leakage that occurs due to it. Contrary, adaptive optimization algorithms can leverage previous samples to search for a better privacyutility tradeoff, allowing them to achieve high rewards. We see that Evolutionary Optimization achieves the lowest epsilon value for the MNIST dataset. In contrast, Bayesian Optimization achieves the same for the CIFAR10 dataset.
4.3. Evaluating Computation Time for Satisfactory Reward
Although finding the best reward is our goal, we also evaluate the computational time required by each algorithm to achieve the maximum reward attained by Grid Search. The time consumed is calculated based on the time taken for an optimization algorithm to achieve a reward equal to or greater than the baseline reward. Here, baseline reward refers to the highest reward achieved by the Grid Search algorithm. We display these results in Figure 3. It can clearly be seen that grid search is much more timeconsuming in nature, compared to other optimization algorithms.
From Figure 3, we can observe that Bayesian and evolutionary optimization algorithms display consistency and uniformity across datasets. Whereas, reinforcementlearningbased optimization fails to generalize the time taken to achieve a given reward. This is due to the explorationexploitation tradeoff mechanism applied by the aforementioned method. An appropriate explorationexploitationepsilon must be selected if the user intends to have a highly exploitative model for optimization.
4.4. Evaluation SampleSpecific Privacy
In this subsection, we look at samplespecific privacy. We count the total number of times that each sample has revisited for each algorithm during optimization. As training on any sample at a given time can lead to sensitive information leakage, we consider this an essential feature for privacy evaluation. In Figure 3, we display a bar graph representing the number of times any sample in the training set is visited before every model achieves its respective highest reward. These evaluations continue to display the impractical nature of Grid Search and validate its high privacy leakage. On the other hand, Bayesian and Evolutionary Optimization continues to give expected results compared to the prior. The Reinforcement Learning approach again displays high variance across datasets, giving it a weaker generalization capacity.
4.5. Learning and Convergence Analysis of Reinforcement Learning Approach
We further study the behavior of the reinforcement learning approach in our analysis through Figure 4. As the model learns over newer data every epoch, we can observe that we continue to adapt the expected reward accordingly. The model makes more resolute changes to the reward estimate nearest to the previous global maxima. This allows the model to exploit better performing hyperparameter values, allowing it to restrict search within a highperforming area. However, this also supplements that as the number of epochs increases, estimates which remain unexplored do not change in value. Therefore, an experimentally robust explorationexploitationepsilon must be selected for generalized results.
On comparing the two datasets, we can see the more complex nature of the search space for the MNIST dataset. This can be attributed to the simple complexity of the dataset, which allows learning over different hyperparameters. However, the CIFAR10 dataset is much more complex, leading to only the most optimal hyperparameters being highlighted.
5. Conclusion and Future Work
In this paper, we discussed different methodologies for hyperparameter tuning for the private training of deep neural networks using DPSGD algorithm. We proposed a novel, customizable reward function that allows users to define a single objective function for establishing their desired privacyutility tradeoff. We quantified, compared, and analyzed the methods of grid search (as the baseline), Bayesian optimization, evolutionary optimization, and reinforcement learning, across two datasets, CIFAR10, and MNIST. We observed that Bayesian and evolutionary optimization behave similarly in terms of the privacyutility tradeoff point they provide, and how efficiently they find it. Reinforcement learning, however, provides a more desirable tradeoff but with varying efficiencies across datasets. All three methods perform much better than the baseline grid search algorithm. We believe that our work serves as a valuable resource for privacypreserving ML practitioners, developers, and researchers for hyperparameter tuning.
For future work, one can use our proposed method alongside that of (Lee and Kifer, 2018; Chen and Lee, 2020), where a portion of the privacy budget is allocated to finding the appropriate learning rate on the private dataset. Another direction is to extend our proposed method to tune other hyperparameters in DPSGD, and even the network architecture and nonlinear activation functions that are used.
Acknowledgement
Mohammad Malekzadeh was partially supported by the UK EPSRC (grant no. EP/T023600/1) within the CHISTERA program.
References
 Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: §1, §2.
 Differentially private learning with adaptive clipping. arXiv preprint arXiv:1905.03871. Cited by: §2.
 Automatic discovery of privacy–utility pareto fronts. Proceedings on Privacy Enhancing Technologies 2020 (4), pp. 5–23. Cited by: §2.
 Practical recommendations for gradientbased training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Cited by: §2.
 [5] () Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. Cited by: §3.3.

Hyperparameter optimization for convolutional neural network committees based on evolutionary algorithms
. In 2017 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 3924–3928. External Links: Document Cited by: §3.2.  Stochastic adaptive line search for differentially private optimization. In 2020 IEEE International Conference on Big Data (Big Data), pp. 1011–1020. Cited by: §5.
 Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §1.
 Deep learning. MIT press. Cited by: §1.
 Concentrated differentially private gradient descent with adaptive periteration privacy budget. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1656–1665. Cited by: §5.
 Differentially private model selection with penalized and constrained likelihood. Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3), pp. 609–633. Cited by: §2.

Dopamine: differentially private federated learning on medical data.
The Second AAAI Workshop on PrivacyPreserving Artificial Intelligence (PPAI21)
. Cited by: §1.  Tempered sigmoid activations for deep learning with differential privacy. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2.
 AdaCliP: adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643. Cited by: §2.
 Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25, pp. . External Links: Link Cited by: §3.3.
 Differentially private learning needs better features (or much more data). In International Conference on Learning Representations (ICLR), Cited by: §2.
 Three tools for practical differential privacy. arXiv preprint arXiv:1812.02890. Cited by: §1, §2.
 Hyperparameter optimization for machine learning models based on bayesian optimizationb. Journal of Electronic Science and Technology 17 (1), pp. 26–40. External Links: ISSN 1674862X, Document, Link Cited by: §3.3.
 Releq: an automatic reinforcement learning approach for deep quantization of neural networks. arXiv preprint arXiv:1811.01704 1 (2). Cited by: §2.
 Optimizing deep learning hyperparameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in HighPerformance Computing Environments, MLHPC ’15, New York, NY, USA. External Links: ISBN 9781450340069, Link, Document Cited by: §3.2.
 Differentially private federated deep learning for multisite medical image segmentation. arXiv preprint arXiv:2107.02586. Cited by: §1.
Comments
There are no comments yet.