DNR_ASP_DAC2021
None
view repo
This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultrahigh model compression with robust adversarial training. This training strategy dynamically adjusts interlayer connectivity based on perlayer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet18, on CIFAR10, CIFAR100 as well as with VGG16 on TinyImageNet. Compared to the baseline uncompressed models, DNR provides over20x compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through stateoftheart alternatives.
READ FULL TEXT VIEW PDF
In safetycritical but computationally resourceconstrained applications...
read it
It is well known that deep neural networks (DNNs) are vulnerable to
adve...
read it
A typical deep neural network (DNN) has a large number of trainable
para...
read it
Neural network quantization and pruning are two techniques commonly used...
read it
Dropout is a wellknown regularization method by sampling a subnetwork ...
read it
Modern deep neural networks (DNNs) become frail when the datasets contai...
read it
Deep Neural Network (DNN) is powerful but computationally expensive and
...
read it
None
In recent years, deep neural networks (DNNs) have emerged as critical components in various applications, including image classification [17], speech recognition [15], medical image analysis [21] and autonomous driving [3]
. However, despite the proliferation of deep learningpowered applications, machine learning models have raised significant security concerns due to their vulnerability to
adversarial examples, i.e., maliciously generated images which are perceptually similar to clean ones with the ability to fool classifier models into making wrong predictions
[2, 8]. Various recent work have proposed associated defense mechanisms including adversarial training [8], hiding gradients [31], adding noise to the weights [14], and several others [24].Meanwhile, large model sizes have high inference latency, computation, and storage costs that represent significant challenges in deployment on IoT devices. Thus reducedsize models [19, 7] and model compression techniques e.g., pruning [4, 5, 12], have gained significant traction. In particular, earlier work showed that without a significant accuracy drop, pruning can remove more than of the model parameters [4, 5] and that ensuring the pruned models have structure can yield observed performance improvements on a broad range of compute platforms [13]. However, adversarial training that increases network robustness generally demands more nonzero parameters than needed for only clean data [23] as illustrated in Fig. 1(a). Thus a naively compressed model performing well on clean images, can become vulnerable to adversarial images. Unfortunately, despite a plethora of work on compressed model performance on clean data, there have been only a few studies on the robustness of compressed models under adversarial attacks.
In particular, some prior works [32, 9] have tried to design a compressed yet robust model through a unified constrained optimization formulation by using the alternating direction method of multipliers (ADMM) in which dynamic regularization is the key to outperforming state of the art pruning techniques [27]. However, in ADMM the network designer needs to specify layerwise sparsity ratios, which requires prior knowledge of an effective compressed model. This knowledge may not be available and thus training may require multiple iterations to determine good layersparsity ratios. Another related research [29] has aimed to use pretrained weights to perform robust pruning and has demonstrated the benefits of finetuning after training in terms of increased performances. In other schemes like Lasso [26], a target compression ratio cannot be set because the final compression ratio is not determined until training is completed. Moreover, Lasso requires separate retraining to increase the accuracy after the assignment of nonsignificant weights to zero, resulting in costly training.
In contrast, this paper presents dynamic network rewiring (DNR), a unified training framework to find a compressed model with increased robustness that does not require individual perlayer target sparsity ratios. In particular, we introduce a hybrid loss function for robust compression which has three major components: a clean image classification loss, a dynamic regularizer term inspired by a relaxed version of ADMM [6], and an adversarial training loss. Inspired by sparselearningbased training scheme of [4]
, we then propose a singleshot training framework to achieve a robust pruned DNN using the proposed loss. In particular, DNR dynamically arranges per layer pruning ratios using normalized momentum, maintaining the target pruning every epoch, without requiring any fine tuning.In summery, our key contributions are:
Given only a global pruning ratio, we propose a singleshot (noniterative) training framework that simultaneously achieves ultrahigh compression ratio, stateoftheart accuracy on clean data, and robustness to perturbed images.
We extend the approach to support structured pruning technique, namely channel pruning, enabling benefits on a broader class of compute platforms. As opposed to conventional sparselearning [4] that can perform only irregular pruning, models generated through structured DNR can significantly speed up inference. To the best of our knowledge, we are the first to propose a noniterative robust training framework that supports both irregular and channel pruning.
We provide a comprehensive investigation of adversarial robustness for both channel and irregular pruning, and obtain insightful observations through evaluation on an extensive set of experiments on CIFAR10 [16], CIFAR100 [16], and TinyImageNet [10] using variants of ResNet18 [11] and VGG16 [30]. Our proposed method consistently outperforms stateoftheart (SOTA) [26, 32] approaches with negligible accuracy drop compared to the unpruned baselines.
We further empirically demonstrate the superiority of our scheme when used to target model compression on cleanonly image classification task compared to SOTA noniterative pruning mechanisms [4, 5, 12, 20].^{1}^{1}1This paper targets lowcost training, thus comparisons to iterative pruning methods (e.g., [27]) are out of scope.
Recently, various adversarial attacks have been proposed to find fake images, i,e., adversarial examples, which have barelyvisible perturbations from real images but still manage to fool a trained DNN. One of the most common attacks is the fast gradient sign method (FGSM) [8]
. Given a vectorized input
x of the real image and corresponding label t, FGSM perturbs each element x in x along the sign of the associated element of the gradient of the inference loss w.r.t. x as shown in Eq. 1 and illustrated in Fig. 1(b). Another common attack is the projected gradient descent (PGD) [23]. The PGD attack is a multistep variant of FGSM where and the iterative update of the perturbed data in step is given in Eq. 2.(1)  
(2) 
Here, the scalar corresponds to the perturbation constraint that determines the severity of the perturbation. generates the output of the DNN, parameterized by . Here, Proj projects the updated adversarial sample onto the projection space which is the  neighbourhood of the benign sample ^{2}^{2}2It is noteworthy that the generated are clipped to a valid range which for our experiments is . x. is the attack step size.
Note that these two strategies assume the attacker knows the details of the DNN and are thus termed as whitebox attacks. We will evaluate the merit of our training scheme by measuring the robustness of our trained models to the fake images generated by these attacks. We argue that this evaluation is more comprehensive than using images generated by attacks that assume limited knowledge of the DNN [28]. Moreover, we note that PGD is one of the strongest adversarial example generation algorithms [23] and use it as part of our proposed framework.
ADMM is a powerful optimization method used to solve problems with nonconvex, combinatorial constraints [1]
. It decomposes the original optimization problem into two subproblems and solves the subproblems iteratively until convergence. Pruning convolutional neural networks (CNNs) can be modeled as an optimization problem where the cardinality of each layer’s weight tensor is bounded by its prespecified pruning ratio. In the ADMM framework, such constraints are transformed to ones represented with indicator functions, such as
for and otherwise. Here, denotes the duplicate variable [1] and represents the target number of nonzero weights determined by prespecified pruning ratios. Next, the original optimization problem is reformulated as:(3) 
where is the Lagrangian multiplier and is the penalization factor when parameters and differ. Eq. (3) is broken into two subproblems which solve and iteratively until convergence [27]
. The first subproblem uses stochastic gradient descent (SGD) to update
while the second subproblem applies projection to find the assignment of that is closest to yet satisfies the cardinality constraint, effectively pruning weights with small magnitudes.Not only can ADMM prune a model’s weight tensors but it also has as a dynamic regularizer. Such adaptive regularization is one of the main reasons behind the success of its use in pruning. However, ADMMbased pruning has several drawbacks. First, ADMM requires prior knowledge of the perlayer pruning ratios. Second, ADMM does not guarantee the pruning ratio will be met, and therefore, an additional round of hard pruning is required after ADMM completes. Third, not all problems solved with ADMM are guaranteed to converge. Fourth, to improve the convergence, needs to be progressively increased across several rounds of training, which increases training time [1].
Sparse learning [4] addresses the shortcomings of ADMM by leveraging exponentially smoothed gradients (momentum) to prune weights. It redistributes pruned weights across layers according to their mean momentum contribution. The weights that will be removed and transferred to other layers are chosen according to their magnitudes while the weights that are brought back (reactivated) are selected based on their momentum values. On the other hand, a major shortcoming of sparse learning compared to ADMM is that it does not benefit from a dynamic regularizer and thus often yields lower levels of accuracy. Furthermore, existing sparselearning schemes only support irregular forms of pruning, limiting speedup on many compute platforms. Finally, sparselearning, to the best of our knowledge, has not previously been extended to robust model compression.
To tackle the shortcomings of ADMM and sparselearning this section introduces a dynamic regularizer that enables noniterative training to achieve high accuracy with compressed models. We then describe a hybrid loss function to provide robustness to the compressed models and an extension to support structured pruning.
For a DNN parameterized by with layers, we let represent the weight tensor of layer . In our sparselearning approach, these weight tensors are elementwise multiplied () by corresponding binary mask tensors () to retain only a fraction of nonzero weights, thereby meeting a target pruning ratio. We update each layer mask in every epoch similar to [4]. The number of nonzeros is updated based on the layer’s normalized momentum and the specific nonzero entries are set to favor large magnitude weights. We incorporate an ADMM dynamic regularizer [27] into this framework by introducing duplicate variable for the nonzero weights, which is in turn updated at the start of every epoch. Unlike [27], we only penalize differences between the masked weights () of a layer and their corresponding duplicate variable . Because the total cardinality constraint of the masked parameters is satisfied, i.e. , the indicator penalty factor is redundant and the loss function may be simplified as
(4) 
where, is the dynamic penalizing factor. This simplification is particularly important because the indicator function used in Eq. 3 is nondifferentiable and its removal in Eq. 4 enables the loss function to be minimized without decomposition into two subproblems.^{3}^{3}3Note this simplified loss function also drops the term because is updated with at the beginning of each epoch, forcing the Lagrangian multiplier and its contribution to the loss function to be always 0. Moreover, SGD with this loss function converges similarly to the SGD with and more reliably than ADMM. Intuitively, the key role of the dynamic regularizer in this simplified loss function is to encourage the DNN to not change values of the weights that have large magnitude unless the corresponding loss is large, similar to what the dynamic regularizer does in ADMMbased pruning.
For a given input image x, adversarial training can be viewed as a minmax optimization problem that finds the model parameters that minimize the loss associated with the corresponding adversarial sample , as shown below:
(5) 
In our framework, we use SGD for loss minimization and PGD to generate adversarial images. More specifically, to boost classification robustness on perturbed data we propose using a hybrid loss function that combines the proposed simplified loss function in Eq. 4 with adversarial image loss, i.e.,
(6) 
provides a tunable tradeoff between the two loss components.
Observation 1 A DNN only having a fraction of weights active throughout the training can be trained with the proposed hybrid loss to finally converge similar to that of the unpruned model (mask ) to provide a robust yet compressed model.
This is exemplified in Fig. 2(a) which shows similar convergence trends for both pruned and unpruned models, simultaneously achieving both the target compression and robustness while also mitigating the requirement of multiple training iterations.
Let the weight tensor of a convolutional layer be denoted as , where and are the height and width of the convolutional kernel, and and represent the number of filters and channels per filter, respectively. We convert this tensor to a 2D weight matrix, with and being the number of rows and columns, respectively. We then partition this matrix into submatrices of rows and columns, one for each channel. To compute the importance of a channel , we find the Frobenius norm (Fnorm) of corresponding submatrix, thus effectively compute = . Based on the fraction of nonzero weights that need to be rewired during an epoch , denoted by the pruning rate , we compute the number of channels that must be pruned from each layer, , and prune the channels with the lowest Fnorms. We then compute each layer’s importance based on the normalized momentum contributed by its nonzero channels. These importance measures are used to determine the number of zeroFnorm channels that should be regrown for each layer . More precisely, we regrow the zeroFnorm channels with the highest Frobenius norms of their momentum. We note that this approach can easily be extended to enable various other forms of structured pruning. Moreover, despite supporting pruning of both convolution and linear layers, this paper focuses on reducing the computational complexity of a DNN. We thus experiment with pruning only convolutional layers because they dominate the computational complexity [18]. The detailed pseudocode of the proposed training framework is shown in Algorithm 1.
It is noteworthy that DNR’s ability to arrange perlayer pruning ratio for robust compression successfully avoids the tedious task of handtuning the pruningratio based on layer sensitivity. To illustrate this, we follow [5] to quantify the sensitivity of a layer by measuring the percentage reduction in classification accuracy on both clean and adversarial images caused by pruning that layer by without pruning other layers.
Observation 2 DNN layers’ sensitivity towards clean and perturbed images are not necessarily equal, thus determining layer pruning ratios for robust models is particularly challenging.
As exemplified in Fig. 2(b), for = there is significant difference in the sensitivity of the layers for clean and perturbed image classification. DNR, on the contrary, automatically finds perlayer pruning ratios (overlaid as pruning sensitivity as in [5]) that serves well for both types of image classification targeting a global pruning of .
Here, we first describe the experimental setup we used to evaluate the effectiveness of the proposed robust training scheme. We then compare our method against other stateoftheart robust pruning techniques based on ADMM [32] and lasso [26]. We also evaluate the merit of DNR as a cleanimage pruning scheme and show that it consistently outperforms contemporary noniterative model pruning techniques [20, 4, 5, 12]
. We finally present an ablation study to empirically evaluate the importance of the dynamic regularizer in the DNR’s loss function. We used Pytorch
[25] to write the models and trained/tested on AWS P3.2x large instances with an NVIDIA Tesla V100 GPU.We selected three widely used datasets, CIFAR10 [16] CIFAR100 [16] and TinyImageNet [10] and picked two well known CNN models, VGG16 [30] and ResNet18 [11]. Both CIFAR10 and CIFAR100 datasets have 50K training samples and 10K test samples with an input image size of . Training and test data size for TinyImageNet are 100k and 10k, respectively where each image size is of
. For all the datasets we used standard data augmentations (horizontal flip and random crop with reflective padding) to train the models with a batch size of 128.
For PGD, we set to , the attack step size , and the number of attack iterations to , the same values as in [14]. For FGSM, we choose the same value as above.
We performed DNR based training for 200/170/60 epochs for CIFAR10/CIFAR100/TinyImageNet, with a starting learning rate of , momentum value of , and weight decay value of . For CIFAR10 and CIFAR100 the learning rate (LR) was reduced by a factor of after , , and epochs. For TinyImageNet we reduced the LR value after and epochs. In addition, we handtuned to and set the pruning rate . We linearly decreased the pruning rate every epoch by . Finally, to balance between the clean and adversarial loss, we set to . Lastly, note that we performed warmup sparse learning [4] for the first 5 epochs with only the clean image loss function before using the hybrid loss function with dynamic regularization (see Eq. 6) for robust compression for the remaining epochs.
Results on CIFAR datasets: We analyzed the impact of our robust training framework on both clean and adversarially generated images with various target compression ratios in the range , where model compression is computed as the ratio of total weights of the model to the nonzero weights in the pruned model. As shown in Figs. 3(ab) DNR can effectively find a robust model with high compression and negligible compromise in accuracy. In particular, for irregular pruning our method can compress up to with negligible drop in accuracy on clean as well as PGD and FGSM based perturbed images, compared to the baseline nonpruned models, tested with VGG16 on CIFAR10 and ResNet18 on CIFAR100.^{4}^{4}4A similar trend is observed for VGG16 on CIFAR100 and ResNet18 on CIFAR10. These are not included in the paper due to space limitations.
Observation 3 As the target compression ratio increases, channel pruning degrades adversarial robustness more significantly than irregular pruning.
As we can see in Fig. 3(ab), the achievable model compression with negligible accuracy loss for structured (channel) pruned models is lower than that achievable through irregular pruning. This trend matches with that of the model’s performance on clean image. However, as we can see in Fig. 3(c), the percentage of channels present in our channelpruned models can be up to lower than its irregular counterparts, implying a similarly large speedup in inference time on a large range of compute platforms [4].
Results on TinyImageNet: As shown in Table 1, DNR can compress the model up to without any compromise in performance for both clean and perturbed image classification.
It is also noteworthy that all our accuracy results for both clean and adversarial images correspond to models that provide the best test accuracy on clean images. This is because robustness gains are typically more relevant on models in which the performance on clean images is least affected.
Pruning  Compression  % Channel  Accuracy (%)  

type  ratio  present  Clean  FGSM  PGD 
Unprunedbaseline  100  50.91  18.19  13.87  
Irregular  98.52  51.71  18.21  14.46  
Channel  74  51.09  17.92  13.54 
Here, were compare the performance of DNR with ADMM [32] and lasso based [26] robust pruning. For ADMM based robust pruning we followed a three stage compression technique namely pretraining, ADMM based pruning, and masked retraining, performing pruning for 30 epochs with as described in [32]. lasso based pruning adds a regularizer to its loss function to penalize the weight magnitudes, where the regularizer coefficient determines the penalty factor. Table 2 shows that our proposed method outperforms both ADMM and Lasso based approaches by a considerable margin, retaining advantages of both worlds ^{5}^{5}5Romanized numbers in the table are results of our experiments, and italicized values are directly taken from the respective original papers.. In particular, compared to ADMM, with VGG16 (ResNet18) model on CIFAR10, DNR provides up to () increased classification accuracy on perturbed images with () higher compression. Compared to Lasso, we achieve () higher compression and up to (), and () increased accuracy on perturbed and clean images, respectively, for VGG16 (ResNet18) on CIFAR10 classification.
Observation 4 Naively tuned perlayer pruning ratio degrades both robustness and cleanimage classification performance of a model.
For this, we evaluated robust compression using naive ADMM, i.e. using naively tuned perlayer pruningratio (all but the 1st layer for a total sparsity). As shown in Table 2, this clearly degrades the performance, implying layersparsity tuning is necessary for ADMM to perform well.
No pre  Perlayer  Target  Pruning  Compre  Accuracy (%)  
Model  Method  trained  sparsity  pruning  type  ssion  
model  knowledge  met  ratio  Clean  FGSM  PGD  
notneeded  
ADMM [32]  ✓  Irregular  
VGG16  ADMM naive  ✓  ✓  83.87  42.46  32.87  
Lasso [26]  ✓  ✓  83.24  50.32  42.01  
DNR  ✓  ✓  ✓  86.74  52.92  43.21  
ADMM [32]  ✓  Irregular  
ResNet18  ADMM naive  ✓  ✓  86.10  50.49  42.24  
Lasso [26]  ✓  ✓  85.92  55.20  46.80  
DNR  ✓  ✓  ✓  87.32  47.35 
To evaluate the merit of DNR as a cleanimage only pruning scheme (DNRC), we trained using DNR with the same loss function minus the adversarial loss term (by setting in Eq. 6) to reach a target pruning ratio. Table 3 shows that our approach consistently outperforms other stateoftheart noniterative pruning approaches based on momentum information [5, 4]
, reinforcementlearning driven autocompression (AMC)
[12], and connectionsensitivity [20]^{†}^{†}footnotemark: . The value in the seventh column represents the error difference from corresponding nonpruned baseline models. We also present performance on CIFAR100 for VGG16 and ResNet18 and TinyImageNet for VGG16.^{6}^{6}6To have an ”apple to apple” comparison we provide results on ResNet50 model for classification on CIFAR10. All other simulations are done on only the ResNet18 variant of ResNet. In particular, we can achieve up to () compression on CIFAR10 dataset with irregular (channel) pruning maintaining accuracy similar to the baseline. On CIFAR100 compression of up to () yields no significant accuracy drop (less than in top1 accuracy) with irregular (channel) pruning. Moreover, our evaluation shows a possible practical speed up of up to for CIFAR10 and for CIFAR100 can be achieved through channel pruning using DNRC. For TinyImageNet, DNRC can provide compression and speedup of up to and , respectively with negligible accuracy drop.Dataset  Model  Method  Pruning  Compress  Error (%)  from  Speedup 

type  ion ratio  top1  baseline  
VGG16  SNIP [20]  Irregular  8.00  0.26  –  
Sparselearning [4]  7.00  0.5  –  
DNRC  6.50  0.09  
DNRC  Channel  1.5  
CIFAR  ResNet50  GSM [5]  Irregular  6.20  0.25  –  
10  AMC [12]  6.45  +0.02  –  
DNRC  4.8  0.07  
ResNet18  DNRC  Irregular  0.10  
Channel  0.27  
VGG16  DNRC  Irregular  27.14  1.04  
CIFAR  Channel  28.78  2.68  
100  ResNet18  DNRC  Irregular  24.9  1.17  
Channel  1.55  
Tiny  VGG16  DNRC  Irregular  40.96  +0.36  
ImageNet  Channel  42.61  1.28 
Accuracy (%) with  Accuracy (%) with  
Model  Method: DNR  irregular pruning  channel pruning  
Clean  FGSM  PGD  Clean  FGSM  PGD  
VGG16  Without dynamic  87.01  50.09  40.62  86.28  49.49  41.25 
With dynamic  86.74  52.92  43.21  85.83  51.03  42.36  
ResNet18  Without dynamic  87.45  53.52  45.33  87.97  53.10  45.91 
With dynamic  87.32  55.13  47.35  87.49  56.09  48.33 
To understand the performance of the proposed hybrid loss function with a dynamic regularizer, we performed ablation with both VGG16 and ResNet18 on CIFAR10 for a target parameter density of and using irregular and channel pruning, respectively. As shown in Table 4, using the dynamic regularizer improves the adversarial classification accuracy by up to for VGG16 and for ResNet18 with similar cleanimage classification performance.
Fig. 4 presents the performance of the pruned models as a function of the PGD attack iteration and the attack bound . In particular, we can see that, for both irregular and channel pruned models, the accuracy degrades with higher number of attack iterations. When increases, the accuracy drop is similar in both the pruning schemes. These trends suggest that our robustness is not achieved via gradient obfuscation [26].
This paper addresses the open problem of achieving ultrahigh compression of DNN models while maintaining their robustness through a noniterative training approach. In particular, the proposed DNR method leverages a novel sparselearning strategy with a hybrid loss function that has a dynamic regularizer to achieve better tradeoffs between accuracy, model size, and robustness. Furthermore, our extension to support channel pruning shows that compressed models produced by DNR can have a practical inference speedup of up to .
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2722–2730. Cited by: §1.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: 3rd item, §4.1.1.Technical report, Citeseer
. Cited by: 3rd item, §4.1.1.
Comments
There are no comments yet.