1 Introduction
Deep neural networks (DNNs) have achieved massive successes in a variety of tasks, including image understanding, natural language processing, etc
[dlbook, dlnature]. Broadly, on the other hand, DNNs may compromise sensitive information carried in the training data [mia, inversion, dlg], thereby raising privacy issues. To counter this, learning algorithms that provide principled privacy guarantees in the line of differential privacy (DP) [dp, ethical, dpcomplexity]have been developed. For instance, differentially private stochastic gradient descent (DPSGD)
[dpsgd], a generally applicable modification of SGD, is widely adopted in differentially private deep learning (DPDL) applications, ranging from medical image recognition [p3sgd], image generation [dpgan], to federated learning [fed], to name a few.However, training DNNs with strong DP guarantees inevitably degrades model utility [dp]. To improve the utility for meaningful DP guarantees, prior works have proposed a wealth of strategies, e.g., gradient dimension reduction [gep, bgep], adaptive clipping [adaclip1, adaclip2], adaptive budget allocation [adptbudget]
, transfer learning with nonsensitive data
[scatter, ppsgd], and so forth. Whilst all these methods focus on improving training algorithms, the impact of model architectures on the utility of DPDL is so far unexplored. Formally, a learning algorithm that trains models from the output set is DP, if holds for all training sets andthat differ by exactly one record. For DPSGD, its output at each step is a highdimension gradient vector, which is affected by model architecture. As a result, using different model architectures do not change the privacy of DPSGD but could have effect on the output space
, thereby affecting the model utility.To investigate the effect of model architectures, we delve into the utility comparisons of several existing DNNs by training them with SGD and DPSGD. The comparisons include handcrafted models [res, dense, wrn, ts] and automatically searched models [nasnet, darts, efnet, real19, enas]. Without loss of generality, three observations can be summarized from Figure 1. Firstly, for DPDL, the models designed without considering private training, e.g. triangle cores, perform much poorer than the model specially designed, e.g. diamond core. Secondly, there is no clear positive correlation between the performance of a model trained with and without DP. Lastly, the model utility obtained with DPSGD varies greatly even though the model size are comparable, which appears to be a different trend comparing to the model performance in conventional deep learning. At a high level, the above observations suggest that (1) Model architectures can significantly affect the utility of DPDL; (2) Instead of directly using DNNs built for conventional deep learning, it is necessary to redesign models for DPDL to improve the utility. Whilst model architecture is important for DPDL, there is little experience or common knowledge to draw on to design DPfriendly models.
In this paper, inspired by the above insights and the advances of neural architecture search (NAS) [nas], we propose DPNAS, the first effort to automatically search models for DPDL. Our motivation is to boost the privacy/utility tradeoffs with least prior knowledge by integrating private learning with architecture search. To this end, we design a novel search space for DPDL and propose a DPaware method for training candidate models during the search process. The searched model DPNASNet achieves new SOTA results. Especially, for privacy budget , we gain the test accuracy of , , and on MNIST, FashionMNIST, and CIFAR10, respectively. In ablation studies, we verify the effectiveness of our approach and the merits of the resulted models. The advantages of DPNAS enables us to not only automatically design better models with little prior knowledge but also summarise some rules about model design for DPDL. We conduct analysis on the resulted models and provide several new findings for designing DPfriendly DNNs, concluded as (1) SELU [selu]
is more suitable for DPDL than Tanh (tempered sigmoid); (2) The activation functions that can retain the negative values could be more effective for DPDL; (3) Max pooling is better than average pooling for DPDL.
Our main contributions are summarized as below:

We propose DPNAS, the first framework that employs NAS to search models for private deep learning. We introduce a novel search space and propose a DPaware method to train the candidate models during the search.

Our searched model DPNASNet substantially advances SOTA privacy/utility tradeoffs for private deep learning.

We conduct analysis on the resulted model architectures and provide several new findings of model design for deep learning with differential privacy.
2 Related Work
Differentially Private Deep Learning.
Differential privacy (DP) [dp] is a measurable definition of privacy that provides provable guarantees against individual identification in a data set. To address the privacy leakage issue in deep learning, Abadi et al.[dpsgd] propose differentially private stochastic gradient descent (DPSGD), which is a generally applicable modification of SGD. Despite providing a DP guarantee, DPSGD brings about a significant cost of model utility. To ensure the utility of DPSGD trained models, the line of works [gep, bgep, adaclip1, adaclip2, adptbudget, ppsgd] make efforts to improve the DP training algorithm while neglecting the benefit of enlarging the algorithm space by varying model architectures. Recently, Papernot et al. [ts] observes that DPSGD can cause exploding model activations, thereby leading to the degradation of model utility. To make models more suitable for DPSGD training, they replace the unbounded ReLUs with the bounded Tanhs as the activation function. However, their design is handcrafted and does not involve other factors of model architectures that may also affect the utility. On the contrary, our work explicitly and systematically studies the effect of model architecture on the utility of DPDL and propose to boost the utility via automatically searching by considering both activation function selection and network topology.
Neural Architecture Search (NAS).
NAS is an automatic model architecture designing process to facilitate fewer human efforts and higher model utilities. A typical process of NAS can be described as follows. The search strategy first samples a candidate model from the search space, where the model is trained and evaluated according to the performance estimation
. Then, the estimation result is feedback to the search strategy as guidance to select better models. Hence, the optimal model could be obtained by sequential iterations. Following this paradigm, widely used search strategies including reinforcement learning (RL)
[nas, nasnet, metaqnn, enas], evolution algorithms (EA) [real17, real19], and gradientbased optimization of architectures [darts, mdenas, pnas] are developed. For providing a subtle search space, previous methods tend to search optimal connections and operations in the cell level [nasnet, darts, enas, real19, pnas, mdenas]. By doing so, the overall architecture of a model is constructed with cells that share the same architecture. The search process attempts to find optimal internal connections and operations of the cell. The connection topology of the cell can be established based on predefined motifs [hienas] or automatically searched structure [nasnet]. The operations normally are set as nonparametric operations and convolutions with different filter sizes. It deserves to note that the choice of activation functions is usually not taken into account in search spaces dealing with nonprivacypreserving deep learning tasks. In contrast to the existing NAS methods, our DPNAS aims at searching highperformance models for private learning with a different training method for candidate models on a sophisticated designed search space involving multiple types of activation functions.3 Methodology
Our goal is to employ NAS to search networks that are more suitable for private training. As the previous NAS methods are proposed for nonprivacypreserving tasks, their formulations are not ideal for achieving our goal. We introduce our reformulation in Section 3.1. Based on our formulation, the design of our DPNAS framework includes three parts, which are the design of search space, the DPaware training method for candidate networks, and the search algorithm. For the remainder of this section, we present our novel search space in Section 3.2, describe the proposed DPaware training method in Section 3.3, and introduce the search algorithm in Section 3.4.
3.1 DPNAS Formulation
NAS is formulated as follows. The architecture search space is a set of candidate architectures that can be denoted by directed acyclic graphs (DAG). For a specific architecture , the corresponding network can be denoted by , where represents the network weights of the candidate architecture . The search of the optimal architecture in NAS can be formulated as a bilevel optimization:
s.t.  (1) 
where is the reward of network on validation data , is an optimization algorithm to find optimal weight on training data given candidate architecture . For traditional architecture searching without considering private training, is typically selected as a nonprivate learning algorithm, e.g. SGD. But for private learning, the resulting architecture will be trained by a differentially private learning algorithm such as DPSGD. As a result, the architectures resulted by the above formulation is not suitable. Instead, while considering privatetraining process, we should replace in equation 3.1 with a differentially private optimization algorithm , which results to DPNAS formulation as following:
s.t.  (2) 
3.2 Search Space Design
We specially design a novel search space for DPDL. Following [nasnet, darts, enas], we search for computation cell as the basic building unit to construct the whole network architecture. As shown in the leftmost part of Figure 2, the overall structure of the network is chained. After an input convolution layer, cells are stacked as main computation blocks. All the cells share the same architecture. The resolution of input features and output features of a cell is the same. After each cell, a 3
3 max pooling layer with stride 2 is stacked as a downsampling layer. The stacked cells with a downsampling layer are referred to as a
block. We repeat stacking blocks for three times and end up with a classification layer.We aim at searching the internal connection of the cell. As shown in the middle part of Figure 2, a cell is a fullyconnected directed acyclic graph consisting of an ordered sequence of nodes. Each node (denoted as blue circle) corresponds to an intermediate feature map . The input node is obtained by applying a convolution with filter size to the inputs, which aims at making the dimension of inputs adapt to the channel size of convolution filters in this cell. The nodes are internal nodes, each of which is connected to all the previous nodes in this cell. Each directed edge is associated with some operation chosen from a predefined operation pool containing candidate operations. Each internal node is computed based on all of its predecessors: . The output of the cell is obtained by applying concatenation to all the internal nodes.
As shown in the rightmost part of Figure 2, our operation pool is designed with involving different types of activation functions which is much different from the existing NAS methods. As mentioned in [ts]
, the choice of activation functions is crucial for improving the model utility for DPDL. Therefore, in our search space, we include 5 different types of nonlinear activation functions, which are ReLU, Tanh, Sigmoid, Hardtanh, and SELU
[selu], as well as identity as a linear activation function. The activation functions are integrated into Conv33NormalizationActivation blocks. We use Group Normalization (GN) [gn]as normalization method because the widely used Batch Normalization
[bn] could cost privacy budget while GN not [scatter]. We also take the topology of cells into considering, which is not considered in [ts]. We include 4 nonparameter operations into our search space, which are Identity, 33 max pooling, 33 average pooling, and Zero. The involvement of these operations enables more possible topologies of the cell. For example, DPNAS can use Zero operation to indicate a lack of connection between two nodes in the cell.3.3 DPaware Training
A differently private learning algorithm execute different process compared with its corresponding nonprivate algorithm. With considering this difference, we need to choose a private training algorithm as in equation 3.1 for training the sampled candidate networks to guide the resulted networks more adaptive to . In our implementation, we use DPSGD as . DPSGD makes two changes to every iteration of SGD before updating model weights with computed gradients. It firstly bounds the sensitivity of the learning process to each individual training example by computing perexample gradients with respect to the training loss for model parameters , and clipping each perexample gradients to a maximum fixed norm . DPSGD then adds Gaussian noise to the average of these perexample gradients, where is noise intensity selected according to the privacy budget [dpsgd]. Previous works [ts, gep] indicate that both of these two steps could make negative impacts on the learning process, thereby degrading the utility of the resulted models.
To make the search process aware of the impact of those two steps, we include those two steps into the training processes of sampled architectures and carefully select the value of hyperparameters and for the search. In practice, we tend to set to be small as the intensity of the added noise scales linearly with
. To make the resulted architectures more adaptive to the gradient clipping, in the search process, we set
small, e.g. . As for , we can choose without considering privacy cost during the search processes because the resulted architectures will be trained on private datasets from scratch. In practice, we empirically find that setting to large values could severely slow down the convergence rate of sampled architectures, thereby making the search process unreliable. Therefore, in our implementation, we set to be relatively small.3.4 Search Strategy
We use RLbased search strategy [nas, nasnet, enas] with parameter sharing [enas]. Given the number of internal nodes , at each step of the search, the RNN controller samples an architecture denoted as a operation sequence . This architecture is trained on a batch of training data using the training method described in Section 3.3. After repeating the above sampling and training step for an epoch on the training set, the RNN controller is then trained with the policy gradient on the validation set aiming at selecting architectures that maximizes the expected reward. The expected reward is defined as the validation accuracy of a network adding the weighted controller entropy. The search process executes the above two processes alternately until the RNN controller achieves convergence. The overall search process is depicted in Alg. 1.
4 Experiments
4.1 Experimental Settings
We run DPNAS on MNIST [mnist], FashionMNIST [fmnist], and CIFAR10 [cifar]. We split the training data of each dataset into the training set and validation set with the ratio of . For the overall architecture, the number of stacked NAS cells in each stage is , the number of internal nodes is , and the internal channels of the three stages are 48, 96, 192 for CIFAR10 and 32, 32, 64 for MNIST and FashionMNIST. The sampled architectures are trained with DPSGD with weight decay
, and moment 0.9. The batch size is set to 300 and the learning rate is set to
. The RNN controller used in our search process is the same as the RNN controller used in [enas]. It is trained with Adam optimizer [adam]. The batch size is set to 64 and the learning rate is set to . The tradeoff weight for controller entropy in the reward is set to 0.05. Our search process runs for 100 epochs. The first 25 epochs are warmup epochs that only training sampled architectures without updating the RNN controller.We evaluate the utility of searched models for DPDL on three common benchmarks: MNIST [mnist], FashionMNIST [fmnist], and CIFAR10 [cifar]. On each dataset, the evaluated models are constructed by the resulted cells searched on this dataset. The evaluated models are trained on the training set from scratch using DPSGD and tested on the testing set. The privacy cost of training is computed by using the Rényi DP analysis of Gaussian mechanism [rdp, srdp]
. We implement the search process and private training by PyTorch
[torch] with opacus package. As for model architectures, the number of stacked NAS cells in each stage is , the number of internal nodes is , and the internal channels of three stages are set to 48, 96, 192 for CIFAR10 and 32, 32, 64 for MNIST and FashionMNIST. All experiments are conducted on a NVIDIA Titan RTX GPU with 24GB of RAM.Datasets  Models  

CNNTanh[ts]  DPNASNetS  DPNASNet  
MNIST  Acc(),  
Acc(),  
Acc(),  
Params  M  M  M  
FashionMNIST  Acc(),  
Acc(),  
Acc(),  
Params  M  M  M  
CIFAR10  Acc(),  
Acc(),  
Acc(),  
Params  M  M 
Models  Params  Acc(%) under budget  

NASNet [nasnet]  M  
NASNetS [nasnet]  M  
AmoebaNet [real19]  M  
AmoebaNetS [real19]  M  
DARTS [darts]  M  
DARTSS [darts]  M  
EfficientNet [efnet]  M  
EfficientNetS [efnet]  M  
DPNASNet  M 
4.2 Main Results
In Table 1, we compare the searched model DPNASNet with the recent SOTA model CNNTanh from [ts], which is a CNN model with Tanh as the activation function. Comparisons are conducted on MNIST, FashionMNIST, and CIFAR10 for different privacy budgets of with fixed privacy parameter
. On each dataset, DPNASNet is built by stacking the searched cell on this dataset. For fair comparison, we also construct a smaller version of DPNASNet by reducing the channel numbers of convolution layers, denoted as DPNASNetS, which has a comparable number of parameters with CNNTanh. We train each model with DPSGD from scratch for 10 times and report the mean and standard deviation of test accuracy of 10 models. Our DPNASNet achieves new SOTA. On MNIST, DPNASNet achieves
for the privacy budget of , whereas the previous SOTA reported in [ts] is for the privacy budget of . On CIFAR10, we match the best accuracy in [ts], namely for the privacy budget of , with a much smaller budget of , which is an improvement in the DPguarantee of . We also consider training DPNASNet for this larger DP budget and we get for the privacy budget of .Search Dataset  Evaluation Dataset (Acc, )  

MNIST  FashionMNIST  CIFAR10  
MNIST  
FashionMNIST  
CIFAR10 
In Table 2, we compare DPNASNet with the models searched by the previous NAS methods that do not consider private learning, e.g. DARTS [darts], NASNet [nasnet], AmoebaNet [real19], and EfficientNet [efnet]. We also compare DPNASNet with the small version of these models that have a comparable number of parameters with DPNASNet. The small models are constructed by reducing the channel numbers of convolution layers in the original models. We train these models on CIFAR10 with DPSGD for different privacy budgets of with . From Table 2, we observe that DPNASNet dramatically superior to the existing models obtained by NAS methods.
4.3 Ablation Studies
We present the ablation study results to verify the effectiveness of our search space, training method, and search strategy. More experimental results are provided in Appendix.
Effect of search space.
To verify the effectiveness of the search space, we replace our search space with two search spaces widely used in previous NAS works, which are NASNet search space [nasnet] and DARTS search space [darts]. In NASNet, the predefined cell structure is different from ours. The cells are divide into two groups, e.g. Normal Cell and Reduction Cell. In a cell, each internal node has two inputs selected from the outputs of previous nodes or previous two layers. In DARTS, the cells are also divided into two groups as NASNet but its predefined cell structure is the same as ours. The candidate operations in both of these two search spaces are much different from ours as they do not involve different types of activation functions. More details about these two search spaces could be found in [nasnet] and [darts]. We apply our search process on these two spaces and compare the resulted models with DPNASNet. As shown in Figure 3(a), DPNASNet is superior to the models searched on these two spaces, which indicates that our search space is more effective.
Effect of training method.
To verify the effectiveness of our training method for sampled architectures during the search process, we run our search process by replacing our training method with the ordinary SGD. Then we use the resulted controller to generate 20 architectures and train all these architectures with DPSGD 10 times on CIFAR10 for the privacy budget of . The results are presented in Figure 3(b). Each blue box represents the 10 times results of each architecture and the blue dotted line is the average accuracy of those 20 architectures. We can see that without using our training method, the resulted architectures still perform better than handcrafted models, e.g. test accuracy of CNNTanh [ts], but all the sampled models are less effective than DPNASNet searched by using our training method.
Effect of search strategy.
To show the effectiveness of the search algorithm, we replace the RLbased search method in our search process with random search and draw the test accuracy of 10 sampled architectures after each epoch in Figure 3(c). We find that, after warmup (the first 25 epochs), the test accuracy of models sampled by RLcontroller keeps increasingly higher than that from the random sample, which means the search algorithm of our DPNAS is effective.
Transferability of searched architectures.
The main results for MNIST, FashionMNIST, and CIFAR10 are obtained by networks searched on these three datasets, respectively. To show the transferability of searched architectures, for each dataset, we construct a model with the cell searched on this dataset and evaluate it on the other two datasets. As showen in Table 3, the cell searched on one dataset also perform well when evaluated on other datasets. They can achieve comparable accuracy with the cells searched on the evaluation dataset. The transferability of NAS searched architectures have been verified in [zoph2018learning]. Our experiments validate that this transferability still holds when private training is taken into considering in the search process.
5 Architecture Analysis and Discussions
Our searched networks for DPDL show meaningful patterns that are distinct from the networks searched for nonprivate image classification. Figure 4 shows the cell architecture of our searched DPNASNet. We conduct qualitative and quantitative analysis on the searched architectures by DPNAS and sum up several observations for designing privatelearningfriendly networks as following.
SELU outperforms Tanh for DPDL.
From Figure 4, we observe that SELU is the most frequently used activation function in the resulted architectures. To statistically analyze the occurrence frequency of each activation function, we use the trained controller to sample 1000 architectures and count the number of each activation function in these architectures. As shown in Figure 5(a), the occurrence frequency of SELU is much higher than others. Tanh also frequently appears and its effectiveness for DPDL has been verified in previous work [ts]. As SELU appears more frequently than Tanh in our results, we wonder whether SELU is better than Tanh to construct networks for DPDL. To answer this question, we employ the simple CNN model from [ts] with SELU and Tanh as activation function respectively to build two models, CNNSELU and CNNTanh. From Figure 5(c), we find that CNNSELU consistently outperforms CNNTanh. Papernot et al. [ts] argue that the reason for Tanh outperforming ReLU for DPDL is that Tanh is bounded while ReLU is unbounded. However, we find that SELU is yet efficient for DPDL, although it is also unbounded. Our intuitive explanation for this is that the activation functions that can retain negative values of their inputs could be more suitable for DPDL. Here, we try to experimentally verify this intuition. From Table 4, we observe that the activation functions that can retain negative values outperform those only having nonnegative values in their outputs. This result is consistent with our intuition. We leave the theoretical explanation for future studies.
Function  Has Negative  Bounded  Test Acc 

ReLU  
ReLU6  
ELU  
SELU  
Tanh  
HardTanh  
LeakyReLU 
MaxPool performs better than AvgPool.
From Figure 4, we also observe that MaxPool appears more frequently than AvgPool. We also do a statistic on the occurrence frequency of each pooling function by using a similar statistical method for activation functions. From Figure 5(b), we find that the portion of MaxPool used in the resulted architectures is much higher than that of AvgPool. Based on this observation, we are curious about whether MaxPool is better than AvgPool for DPSGD trained models’ utility. To figure it out, we conduct a comparison on two simple CNN models. One model employs MaxPool for all its pooling operations, and the other uses AvgPool for all pooling layers. Both of these two models are trained with DPSGD on CIFAR10 with the same settings. Figure 5(d) shows that MaxPool is a better selection than AvgPool for the DPSGD trained models.
6 Conclusion
We demonstrate that the model architecture has a significant impact on the utility of DPDL. We then present DPNAS, the first framework of automatically searching models for DPDL. We specially design a novel search space and propose a DPaware training method in DPNAS to guide the searched models to be adaptive to DPSGD training. The searched model DPNASNet consistently advances the SOTA accuracy on different benchmarks. Finally, we analyze the generated architectures and provide several new findings of operation selection for designing privatelearningfriendly DNNs.
References
Appendix
Running time of DPNAS search process.
The running times of search on MNIST with and without DPaware training are 232 s/epoch and 98 s/epoch, respectively. In general, the search of DPNAS is slower than the search s without using DPaware training because the perexample gradient calculation operation of DPSGD can slow down the training. But we note that the running time of searching relies heavily on the implementation of DPSGD. There are some techniques [subramani2020enabling, lee2020scaling] that try to speed up perexample gradient clipping, which are applicable to reduce the search time of DPNAS.
Inference time of DPNASNet.
We report the inference time of CNNTanh, DPNASNet, and DPNASNetsmall in Table 5. We observe that with comparable number of parameters, our architecture DPNASNetsmall is a little bit slower than CNNTanh but achieves better test accuracy. The reason why our architecture is slower could be that the found cell architectures contain multibranch, which is not computationally friendly. As we focus on improving the utility of deep learning with DP, the efficiency of resulted models is not the primary goal of this paper. But it could be an interesting direction to consider both DPlearning and efficiency constraints into NAS for searching models that are DPfriendly and efficient. We leave this for future work.
Model  Parameter  Inference Time  Accuracy 

CNNTanh  0.03M  0.1577s  
DPNASNetsmall  0.03M  0.2505s  
DPNASNet  0.21M  0.4703s 
Final architecture varies as a function of the activation or pooling layer.
In Table 6 and Table 7, we conduct two ablation studies to evaluate the performance of the resulted architecture varies as a function of activation functions and pooling layers, respectively. In the first experiment, we identify the resulted architecture and then replace all the activation functions with the same one type of activation. For example, "only ReLU" indicates we replace all activation functions with ReLU in our searched architecture. We also evaluate another architecture which is obtained by replace each activation function with a random sampled one at each node. The experiment on pooling layers is conducted in a similar way with that on activation function. The results indicate that comparing with searching by taking both architecture topology and component selection into account, decoupling the architecture topology and component selection could lead to suboptimal results.
Privacy leakage risk of the search process.
As the architecture search process directly uses private set, a question we should discuss is whether the architecture search could inadvertently leak private information about the training set. To answer this question, we first conduct a simple sanity check by evaluating the test accuracy of DPNASNet at initialization. We test DPNASNet for 10 times on CIFAR10 with different random seed for initialization and the average accuracy of 10 results is . We do not observe the resulted model achieves accuracy that are significantly higher than 10% (random guess) at initialization. However, intuitively, the search process on private set still have potential to leak privacy about training data. One way to completely avoid this is to conduct searching on a public dataset and apply the resulted model to learn on private dataset, e.g., search on FashionMNIST and apply on MNIST or CIFAR. As shown in Table 3 of original paper, the cell architectures searched by DPNAS have great transferability, which means the above solution is feasible.
Model type  Accuracy () 

only ReLU  66.36 
only SELU  68.06 
only Tanh  65.31 
only Linear  66.81 
only Hardtanh  64.80 
only Sigmoid  58.09 
random act  66.09 
DPNASNet  68.33 
Model type  Accuracy () 

only MaxPool  68.33 
only AvgPool  68.47 
random pooling  66.53 
DPNASNet  68.33 