A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs

11/03/2020 ∙ by Souvik Kundu, et al. ∙ 0

This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, DNR provides over20x compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks (DNNs) have emerged as critical components in various applications, including image classification [17], speech recognition [15], medical image analysis [21] and autonomous driving [3]

. However, despite the proliferation of deep learning-powered applications, machine learning models have raised significant security concerns due to their vulnerability to

adversarial examples

, i.e., maliciously generated images which are perceptually similar to clean ones with the ability to fool classifier models into making wrong predictions

[2, 8]. Various recent work have proposed associated defense mechanisms including adversarial training [8], hiding gradients [31], adding noise to the weights [14], and several others [24].

Figure 1: (a) Weight distribution of the convolution layer of ResNet18 model for different training schemes: normal, adversarial [23], and noisy adversarial [14]. (b) An adversarially generated image () obtained through FGSM attack, which is predicted to be the number 5 instead of 4 (x).

Meanwhile, large model sizes have high inference latency, computation, and storage costs that represent significant challenges in deployment on IoT devices. Thus reduced-size models [19, 7] and model compression techniques e.g., pruning [4, 5, 12], have gained significant traction. In particular, earlier work showed that without a significant accuracy drop, pruning can remove more than of the model parameters [4, 5] and that ensuring the pruned models have structure can yield observed performance improvements on a broad range of compute platforms [13]. However, adversarial training that increases network robustness generally demands more non-zero parameters than needed for only clean data [23] as illustrated in Fig. 1(a). Thus a naively compressed model performing well on clean images, can become vulnerable to adversarial images. Unfortunately, despite a plethora of work on compressed model performance on clean data, there have been only a few studies on the robustness of compressed models under adversarial attacks.

In particular, some prior works [32, 9] have tried to design a compressed yet robust model through a unified constrained optimization formulation by using the alternating direction method of multipliers (ADMM) in which dynamic regularization is the key to outperforming state of the art pruning techniques [27]. However, in ADMM the network designer needs to specify layer-wise sparsity ratios, which requires prior knowledge of an effective compressed model. This knowledge may not be available and thus training may require multiple iterations to determine good layer-sparsity ratios. Another related research [29] has aimed to use pre-trained weights to perform robust pruning and has demonstrated the benefits of fine-tuning after training in terms of increased performances. In other schemes like Lasso [26], a target compression ratio cannot be set because the final compression ratio is not determined until training is completed. Moreover, Lasso requires separate re-training to increase the accuracy after the assignment of non-significant weights to zero, resulting in costly training.

In contrast, this paper presents dynamic network rewiring (DNR), a unified training framework to find a compressed model with increased robustness that does not require individual per-layer target sparsity ratios. In particular, we introduce a hybrid loss function for robust compression which has three major components: a clean image classification loss, a dynamic -regularizer term inspired by a relaxed version of ADMM [6], and an adversarial training loss. Inspired by sparse-learning-based training scheme of [4]

, we then propose a single-shot training framework to achieve a robust pruned DNN using the proposed loss. In particular, DNR dynamically arranges per layer pruning ratios using normalized momentum, maintaining the target pruning every epoch, without requiring any fine tuning.In summery, our key contributions are:

  • Given only a global pruning ratio, we propose a single-shot (non-iterative) training framework that simultaneously achieves ultra-high compression ratio, state-of-the-art accuracy on clean data, and robustness to perturbed images.

  • We extend the approach to support structured pruning technique, namely channel pruning, enabling benefits on a broader class of compute platforms. As opposed to conventional sparse-learning [4] that can perform only irregular pruning, models generated through structured DNR can significantly speed up inference. To the best of our knowledge, we are the first to propose a non-iterative robust training framework that supports both irregular and channel pruning.

  • We provide a comprehensive investigation of adversarial robustness for both channel and irregular pruning, and obtain insightful observations through evaluation on an extensive set of experiments on CIFAR-10 [16], CIFAR-100 [16], and Tiny-ImageNet [10] using variants of ResNet18 [11] and VGG16 [30]. Our proposed method consistently outperforms state-of-the-art (SOTA) [26, 32] approaches with negligible accuracy drop compared to the unpruned baselines.

We further empirically demonstrate the superiority of our scheme when used to target model compression on clean-only image classification task compared to SOTA non-iterative pruning mechanisms [4, 5, 12, 20].111This paper targets low-cost training, thus comparisons to iterative pruning methods (e.g., [27]) are out of scope.

The remainder of this paper is structured as follows. In section 2 we present necessary background work. Section 3 describes proposed DNR based training method. We present our experimental results in Section 4 and conclude in Section 5.

2 Background Work

2.1 Adversarial Attacks

Recently, various adversarial attacks have been proposed to find fake images, i,e., adversarial examples, which have barely-visible perturbations from real images but still manage to fool a trained DNN. One of the most common attacks is the fast gradient sign method (FGSM) [8]

. Given a vectorized input

x of the real image and corresponding label t, FGSM perturbs each element x in x along the sign of the associated element of the gradient of the inference loss w.r.t. x as shown in Eq. 1 and illustrated in Fig. 1(b). Another common attack is the projected gradient descent (PGD) [23]. The PGD attack is a multi-step variant of FGSM where and the iterative update of the perturbed data in step is given in Eq. 2.

(1)
(2)

Here, the scalar corresponds to the perturbation constraint that determines the severity of the perturbation. generates the output of the DNN, parameterized by . Here, Proj projects the updated adversarial sample onto the projection space which is the - neighbourhood of the benign sample 222It is noteworthy that the generated are clipped to a valid range which for our experiments is . x. is the attack step size.

Note that these two strategies assume the attacker knows the details of the DNN and are thus termed as white-box attacks. We will evaluate the merit of our training scheme by measuring the robustness of our trained models to the fake images generated by these attacks. We argue that this evaluation is more comprehensive than using images generated by attacks that assume limited knowledge of the DNN [28]. Moreover, we note that PGD is one of the strongest adversarial example generation algorithms [23] and use it as part of our proposed framework.

2.2 Model Compression

ADMM is a powerful optimization method used to solve problems with non-convex, combinatorial constraints [1]

. It decomposes the original optimization problem into two sub-problems and solves the sub-problems iteratively until convergence. Pruning convolutional neural networks (CNNs) can be modeled as an optimization problem where the cardinality of each layer’s weight tensor is bounded by its pre-specified pruning ratio. In the ADMM framework, such constraints are transformed to ones represented with indicator functions, such as

for and otherwise. Here, denotes the duplicate variable [1] and represents the target number of non-zero weights determined by pre-specified pruning ratios. Next, the original optimization problem is reformulated as:

(3)

where is the Lagrangian multiplier and is the penalization factor when parameters and differ. Eq. (3) is broken into two sub-problems which solve and iteratively until convergence [27]

. The first sub-problem uses stochastic gradient descent (SGD) to update

while the second sub-problem applies projection to find the assignment of that is closest to yet satisfies the cardinality constraint, effectively pruning weights with small magnitudes.

Not only can ADMM prune a model’s weight tensors but it also has as a dynamic regularizer. Such adaptive regularization is one of the main reasons behind the success of its use in pruning. However, ADMM-based pruning has several drawbacks. First, ADMM requires prior knowledge of the per-layer pruning ratios. Second, ADMM does not guarantee the pruning ratio will be met, and therefore, an additional round of hard pruning is required after ADMM completes. Third, not all problems solved with ADMM are guaranteed to converge. Fourth, to improve the convergence, needs to be progressively increased across several rounds of training, which increases training time [1].

Sparse learning [4] addresses the shortcomings of ADMM by leveraging exponentially smoothed gradients (momentum) to prune weights. It redistributes pruned weights across layers according to their mean momentum contribution. The weights that will be removed and transferred to other layers are chosen according to their magnitudes while the weights that are brought back (reactivated) are selected based on their momentum values. On the other hand, a major shortcoming of sparse learning compared to ADMM is that it does not benefit from a dynamic regularizer and thus often yields lower levels of accuracy. Furthermore, existing sparse-learning schemes only support irregular forms of pruning, limiting speed-up on many compute platforms. Finally, sparse-learning, to the best of our knowledge, has not previously been extended to robust model compression.

3 Dynamic Network Rewiring

To tackle the shortcomings of ADMM and sparse-learning this section introduces a dynamic regularizer that enables non-iterative training to achieve high accuracy with compressed models. We then describe a hybrid loss function to provide robustness to the compressed models and an extension to support structured pruning.

3.1 Dynamic Regularizer

For a DNN parameterized by with layers, we let represent the weight tensor of layer . In our sparse-learning approach, these weight tensors are element-wise multiplied () by corresponding binary mask tensors () to retain only a fraction of non-zero weights, thereby meeting a target pruning ratio. We update each layer mask in every epoch similar to [4]. The number of non-zeros is updated based on the layer’s normalized momentum and the specific non-zero entries are set to favor large magnitude weights. We incorporate an ADMM dynamic regularizer [27] into this framework by introducing duplicate variable for the non-zero weights, which is in turn updated at the start of every epoch. Unlike [27], we only penalize differences between the masked weights () of a layer and their corresponding duplicate variable . Because the total cardinality constraint of the masked parameters is satisfied, i.e. , the indicator penalty factor is redundant and the loss function may be simplified as

(4)

where, is the dynamic penalizing factor. This simplification is particularly important because the indicator function used in Eq. 3 is non-differentiable and its removal in Eq. 4 enables the loss function to be minimized without decomposition into two sub-problems.333Note this simplified loss function also drops the term because is updated with at the beginning of each epoch, forcing the Lagrangian multiplier and its contribution to the loss function to be always 0. Moreover, SGD with this loss function converges similarly to the SGD with and more reliably than ADMM. Intuitively, the key role of the dynamic regularizer in this simplified loss function is to encourage the DNN to not change values of the weights that have large magnitude unless the corresponding loss is large, similar to what the dynamic regularizer does in ADMM-based pruning.

3.2 Proposed Hybrid Loss Function

For a given input image x, adversarial training can be viewed as a min-max optimization problem that finds the model parameters that minimize the loss associated with the corresponding adversarial sample , as shown below:

(5)

In our framework, we use SGD for loss minimization and PGD to generate adversarial images. More specifically, to boost classification robustness on perturbed data we propose using a hybrid loss function that combines the proposed simplified loss function in Eq. 4 with adversarial image loss, i.e.,

(6)

provides a tunable trade-off between the two loss components.

Observation 1 A DNN only having a fraction of weights active throughout the training can be trained with the proposed hybrid loss to finally converge similar to that of the un-pruned model (mask ) to provide a robust yet compressed model.

This is exemplified in Fig. 2(a) which shows similar convergence trends for both pruned and unpruned models, simultaneously achieving both the target compression and robustness while also mitigating the requirement of multiple training iterations.

3.3 Support for Channel Pruning

Let the weight tensor of a convolutional layer be denoted as , where and are the height and width of the convolutional kernel, and and represent the number of filters and channels per filter, respectively. We convert this tensor to a 2D weight matrix, with and being the number of rows and columns, respectively. We then partition this matrix into sub-matrices of rows and columns, one for each channel. To compute the importance of a channel , we find the Frobenius norm (F-norm) of corresponding sub-matrix, thus effectively compute = . Based on the fraction of non-zero weights that need to be rewired during an epoch , denoted by the pruning rate , we compute the number of channels that must be pruned from each layer, , and prune the channels with the lowest F-norms. We then compute each layer’s importance based on the normalized momentum contributed by its non-zero channels. These importance measures are used to determine the number of zero-F-norm channels that should be re-grown for each layer . More precisely, we re-grow the zero-F-norm channels with the highest Frobenius norms of their momentum. We note that this approach can easily be extended to enable various other forms of structured pruning. Moreover, despite supporting pruning of both convolution and linear layers, this paper focuses on reducing the computational complexity of a DNN. We thus experiment with pruning only convolutional layers because they dominate the computational complexity [18]. The detailed pseudo-code of the proposed training framework is shown in Algorithm 1.

Figure 2: (a) Training loss vs. epochs and (b) Pruning sensitivity per layer for VGG16 on CIFAR-10.
Data: weight , momentum , binary mask M
Data: density , , pruning rate pT: irregular or channel
1 for  to  do
2         &
3 end for
4for  to numEpochs do
5        for  to numBatches do
6               for  to  do
7                     
8               end for
9              
10        end for
11        for  to  do
12              
13        end for
14       
15 end for
Algorithm 1 DNR Training.

It is noteworthy that DNR’s ability to arrange per-layer pruning ratio for robust compression successfully avoids the tedious task of hand-tuning the pruning-ratio based on layer sensitivity. To illustrate this, we follow [5] to quantify the sensitivity of a layer by measuring the percentage reduction in classification accuracy on both clean and adversarial images caused by pruning that layer by without pruning other layers.

Observation 2 DNN layers’ sensitivity towards clean and perturbed images are not necessarily equal, thus determining layer pruning ratios for robust models is particularly challenging.

As exemplified in Fig. 2(b), for = there is significant difference in the sensitivity of the layers for clean and perturbed image classification. DNR, on the contrary, automatically finds per-layer pruning ratios (overlaid as pruning sensitivity as in [5]) that serves well for both types of image classification targeting a global pruning of .

4 Experiments

Here, we first describe the experimental setup we used to evaluate the effectiveness of the proposed robust training scheme. We then compare our method against other state-of-the-art robust pruning techniques based on ADMM [32] and lasso [26]. We also evaluate the merit of DNR as a clean-image pruning scheme and show that it consistently outperforms contemporary non-iterative model pruning techniques [20, 4, 5, 12]

. We finally present an ablation study to empirically evaluate the importance of the dynamic regularizer in the DNR’s loss function. We used Pytorch

[25] to write the models and trained/tested on AWS P3.2x large instances with an NVIDIA Tesla V100 GPU.

4.1 Experimental Setup

4.1.1 Models and Datasets

We selected three widely used datasets, CIFAR-10 [16] CIFAR-100 [16] and Tiny-ImageNet [10] and picked two well known CNN models, VGG16 [30] and ResNet18 [11]. Both CIFAR-10 and CIFAR-100 datasets have 50K training samples and 10K test samples with an input image size of . Training and test data size for Tiny-ImageNet are 100k and 10k, respectively where each image size is of

. For all the datasets we used standard data augmentations (horizontal flip and random crop with reflective padding) to train the models with a batch size of 128.

4.1.2 Adversarial Attack and DNR Training Settings

For PGD, we set to , the attack step size , and the number of attack iterations to , the same values as in [14]. For FGSM, we choose the same value as above.

We performed DNR based training for 200/170/60 epochs for CIFAR-10/CIFAR-100/Tiny-ImageNet, with a starting learning rate of , momentum value of , and weight decay value of . For CIFAR-10 and CIFAR-100 the learning rate (LR) was reduced by a factor of after , , and epochs. For Tiny-ImageNet we reduced the LR value after and epochs. In addition, we hand-tuned to and set the pruning rate . We linearly decreased the pruning rate every epoch by . Finally, to balance between the clean and adversarial loss, we set to . Lastly, note that we performed warm-up sparse learning [4] for the first 5 epochs with only the clean image loss function before using the hybrid loss function with dynamic regularization (see Eq. 6) for robust compression for the remaining epochs.

4.2 Results

Figure 3: Model compression vs. accuracy (on both clean and adversarially generated images) for irregular and channel pruning evaluated with (a) VGG16 on CIFAR-10 and (b) ResNet18 on CIFAR-100. (c) Comparison of channel pruning with irregular pruning in terms of % of channels present. Note that the % of channels present correlates with inference time [22, 4].

Results on CIFAR datasets: We analyzed the impact of our robust training framework on both clean and adversarially generated images with various target compression ratios in the range , where model compression is computed as the ratio of total weights of the model to the non-zero weights in the pruned model. As shown in Figs. 3(a-b) DNR can effectively find a robust model with high compression and negligible compromise in accuracy. In particular, for irregular pruning our method can compress up to with negligible drop in accuracy on clean as well as PGD and FGSM based perturbed images, compared to the baseline non-pruned models, tested with VGG16 on CIFAR-10 and ResNet18 on CIFAR-100.444A similar trend is observed for VGG16 on CIFAR-100 and ResNet18 on CIFAR-10. These are not included in the paper due to space limitations.

Observation 3 As the target compression ratio increases, channel pruning degrades adversarial robustness more significantly than irregular pruning.

As we can see in Fig. 3(a-b), the achievable model compression with negligible accuracy loss for structured (channel) pruned models is lower than that achievable through irregular pruning. This trend matches with that of the model’s performance on clean image. However, as we can see in Fig. 3(c), the percentage of channels present in our channel-pruned models can be up to lower than its irregular counterparts, implying a similarly large speedup in inference time on a large range of compute platforms [4].
Results on Tiny-ImageNet: As shown in Table 1, DNR can compress the model up to without any compromise in performance for both clean and perturbed image classification.

It is also noteworthy that all our accuracy results for both clean and adversarial images correspond to models that provide the best test accuracy on clean images. This is because robustness gains are typically more relevant on models in which the performance on clean images is least affected.

Pruning Compression % Channel Accuracy (%)
type -ratio present Clean FGSM PGD
Unpruned-baseline 100 50.91 18.19 13.87
Irregular 98.52 51.71 18.21 14.46
Channel 74 51.09 17.92 13.54
Table 1: Results on VGG16 to classify Tiny-ImageNet.

4.3 Comparison with State-of-the-art

Here, were compare the performance of DNR with ADMM [32] and lasso based [26] robust pruning. For ADMM based robust pruning we followed a three stage compression technique namely pre-training, ADMM based pruning, and masked retraining, performing pruning for 30 epochs with as described in [32]. lasso based pruning adds a regularizer to its loss function to penalize the weight magnitudes, where the regularizer coefficient determines the penalty factor. Table 2 shows that our proposed method outperforms both ADMM and Lasso based approaches by a considerable margin, retaining advantages of both worlds 555Romanized numbers in the table are results of our experiments, and italicized values are directly taken from the respective original papers.. In particular, compared to ADMM, with VGG16 (ResNet18) model on CIFAR-10, DNR provides up to () increased classification accuracy on perturbed images with () higher compression. Compared to Lasso, we achieve () higher compression and up to (), and () increased accuracy on perturbed and clean images, respectively, for VGG16 (ResNet18) on CIFAR-10 classification.

Observation 4 Naively tuned per-layer pruning ratio degrades both robustness and clean-image classification performance of a model.

For this, we evaluated robust compression using naive ADMM, i.e. using naively tuned per-layer pruning-ratio (all but the 1st layer for a total sparsity). As shown in Table 2, this clearly degrades the performance, implying layer-sparsity tuning is necessary for ADMM to perform well.

No pre- Per-layer Target Pruning Compre- Accuracy (%)
Model Method trained sparsity pruning type ssion
model knowledge met ratio Clean FGSM PGD
not-needed
ADMM [32] Irregular
VGG16 ADMM naive 83.87 42.46 32.87
Lasso [26] 83.24 50.32 42.01
DNR 86.74 52.92 43.21
ADMM [32] Irregular
ResNet18 ADMM naive 86.10 50.49 42.24
Lasso [26] 85.92 55.20 46.80
DNR 87.32 47.35
Table 2: Comparison of DNR, ADMM based, and lasso based robust pruning schemes on CIFAR-10.

4.4 Pruning to Classify Clean-only Images

To evaluate the merit of DNR as a clean-image only pruning scheme (DNR-C), we trained using DNR with the same loss function minus the adversarial loss term (by setting in Eq. 6) to reach a target pruning ratio. Table 3 shows that our approach consistently outperforms other state-of-the-art non-iterative pruning approaches based on momentum information [5, 4]

, reinforcement-learning driven auto-compression (AMC)

[12], and connection-sensitivity [20]footnotemark: . The value in the seventh column represents the error difference from corresponding non-pruned baseline models. We also present performance on CIFAR-100 for VGG16 and ResNet18 and Tiny-ImageNet for VGG16.666To have an ”apple to apple” comparison we provide results on ResNet50 model for classification on CIFAR-10. All other simulations are done on only the ResNet18 variant of ResNet. In particular, we can achieve up to () compression on CIFAR-10 dataset with irregular (channel) pruning maintaining accuracy similar to the baseline. On CIFAR-100 compression of up to () yields no significant accuracy drop (less than in top-1 accuracy) with irregular (channel) pruning. Moreover, our evaluation shows a possible practical speed up of up to for CIFAR-10 and for CIFAR-100 can be achieved through channel pruning using DNR-C. For Tiny-ImageNet, DNR-C can provide compression and speed-up of up to and , respectively with negligible accuracy drop.

Dataset Model Method Pruning Compress- Error (%) from Speedup

type ion ratio top-1 baseline
VGG16 SNIP [20] Irregular 8.00 -0.26
Sparse-learning [4] 7.00 -0.5
DNR-C 6.50 -0.09
DNR-C Channel -1.5
CIFAR ResNet50 GSM [5] Irregular 6.20 -0.25
-10 AMC [12] 6.45 +0.02
DNR-C 4.8 -0.07
ResNet18 DNR-C Irregular -0.10
Channel -0.27
VGG16 DNR-C Irregular 27.14 -1.04
CIFAR Channel 28.78 -2.68
-100 ResNet18 DNR-C Irregular 24.9 -1.17
Channel -1.55
Tiny VGG16 DNR-C Irregular 40.96 +0.36
ImageNet Channel 42.61 -1.28
Table 3: Comparison with state-of-the-art non-iterative pruning schemes on CIFAR-10 and comparison of deviation from baseline on CIFAR-100 and Tiny-ImageNet.
Accuracy (%) with Accuracy (%) with
Model Method: DNR irregular pruning channel pruning
Clean FGSM PGD Clean FGSM PGD
VGG16 Without dynamic 87.01 50.09 40.62 86.28 49.49 41.25
With dynamic 86.74 52.92 43.21 85.83 51.03 42.36
ResNet18 Without dynamic 87.45 53.52 45.33 87.97 53.10 45.91
With dynamic 87.32 55.13 47.35 87.49 56.09 48.33
Table 4: Comparison of DNR with and without the dynamic regularizer for CIFAR-10 classification.

4.5 Ablation Study

To understand the performance of the proposed hybrid loss function with a dynamic regularizer, we performed ablation with both VGG16 and ResNet18 on CIFAR-10 for a target parameter density of and using irregular and channel pruning, respectively. As shown in Table 4, using the dynamic regularizer improves the adversarial classification accuracy by up to for VGG16 and for ResNet18 with similar clean-image classification performance.

4.6 Generalized Robustness Against PGD Attack of Different Strengths

Fig. 4 presents the performance of the pruned models as a function of the PGD attack iteration and the attack bound . In particular, we can see that, for both irregular and channel pruned models, the accuracy degrades with higher number of attack iterations. When increases, the accuracy drop is similar in both the pruning schemes. These trends suggest that our robustness is not achieved via gradient obfuscation [26].

Figure 4: On CIFAR-10, the perturbed data accuracy of ResNet18 under PGD attack versus increasing (a), (c) attack iteration and (b), (d) attack bound for irregular ( density), and channel pruned ( density) models, respectively.

5 Conclusions

This paper addresses the open problem of achieving ultra-high compression of DNN models while maintaining their robustness through a non-iterative training approach. In particular, the proposed DNR method leverages a novel sparse-learning strategy with a hybrid loss function that has a dynamic regularizer to achieve better trade-offs between accuracy, model size, and robustness. Furthermore, our extension to support channel pruning shows that compressed models produced by DNR can have a practical inference speed-up of up to .

References

  • [1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3 (1), pp. 1–122. Cited by: §2.2, §2.2.
  • [2] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57. Cited by: §1.
  • [3] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 2722–2730. Cited by: §1.
  • [4] T. Dettmers and L. Zettlemoyer (2019) Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840. Cited by: 2nd item, §1, §1, §1, §2.2, §3.1, Figure 3, §4.1.2, §4.2, §4.4, Table 3, §4.
  • [5] X. Ding, X. Zhou, Y. Guo, J. Han, J. Liu, et al. (2019) Global sparse momentum SGD for pruning very deep neural networks. In Advances in Neural Information Processing Systems, pp. 6379–6391. Cited by: §1, §1, §3.3, §3.3, §4.4, Table 3, §4.
  • [6] T. Dinh and J. Xin (2018) Convergence of a relaxed variable splitting method for learning sparse neural networks via ,, and transformed- penalties. arXiv preprint arXiv:1812.05719. Cited by: §1.
  • [7] A. Fayyazi, S. Kundu, S. Nazarian, P. A. Beerel, and M. Pedram (2019) CSrram: area-efficient low-power ex-situ training framework for memristive neuromorphic circuits based on clustered sparsity. In 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 465–470. Cited by: §1.
  • [8] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §2.1.
  • [9] S. Gui, H. N. Wang, H. Yang, C. Yu, Z. Wang, and J. Liu (2019) Model compression with adversarial robustness: a unified optimization framework. In Advances in Neural Information Processing Systems, pp. 1283–1294. Cited by: §1.
  • [10] L. Hansen (2015) Tiny ImageNet challenge submission. CS 231N. Cited by: 3rd item, §4.1.1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: 3rd item, §4.1.1.
  • [12] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018-09) AMC: autoML for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §4.4, Table 3, §4.
  • [13] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1.
  • [14] Z. He, A. S. Rakin, and D. Fan (2019) Parametric noise injection: trainable randomness to improve deep neural network robustness against adversarial attack. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–597. Cited by: Figure 1, §1, §4.1.2.
  • [15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal processing magazine 29 (6), pp. 82–97. Cited by: §1.
  • [16] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images.

    Technical report, Citeseer

    .
    Cited by: 3rd item, §4.1.1.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [18] S. Kundu, M. Nazemi, M. Pedram, K. M. Chugg, and P. A. Beerel (2020) Pre-defined sparsity for low-complexity convolutional neural networks. IEEE Transactions on Computers. Cited by: §3.3.
  • [19] S. Kundu, S. Prakash, H. Akrami, P. A. Beerel, and K. M. Chugg (2019) PSConv: a pre-defined sparse kernel based convolution for deep CNNs. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 100–107. Cited by: §1.
  • [20] N. Lee, T. Ajanthan, and P. H. Torr (2018) SNIP: single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340. Cited by: §1, §4.4, Table 3, §4.
  • [21] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §1.
  • [22] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: Figure 3.
  • [23] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: Figure 1, §1, §2.1, §2.1.
  • [24] D. Meng and H. Chen (2017) Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135–147. Cited by: §1.
  • [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.
  • [26] A. S. Rakin, Z. He, L. Yang, Y. Wang, L. Wang, and D. Fan (2019) Robust sparse regularization: simultaneously optimizing neural network robustness and compactness. arXiv preprint arXiv:1905.13074. Cited by: 3rd item, §1, §4.3, §4.6, Table 2, §4.
  • [27] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang (2019) ADMM-NN: an algorithm-hardware co-design framework of DNNs using alternating direction methods of multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 925–938. Cited by: §1, §2.2, §3.1, footnote 1.
  • [28] K. Ren, T. Zheng, Z. Qin, and X. Liu (2020) Adversarial attacks and defenses in deep learning. Engineering. Cited by: §2.1.
  • [29] V. Sehwag, S. Wang, P. Mittal, and S. Jana (2020) HYDRA: pruning adversarially robust neural networks. External Links: 2002.10509 Cited by: §1.
  • [30] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: 3rd item, §4.1.1.
  • [31] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1.
  • [32] S. Ye, K. Xu, S. Liu, H. Cheng, J. Lambrechts, H. Zhang, A. Zhou, K. Ma, Y. Wang, and X. Lin (2019) Adversarial robustness vs. model compression, or both. In The IEEE International Conference on Computer Vision (ICCV), Vol. 2. Cited by: 3rd item, §1, §4.3, Table 2, §4.