DeepAI
Log In Sign Up

SparseVLR: A Novel Framework for Verified Locally Robust Sparse Neural Networks Search

The compute-intensive nature of neural networks (NNs) limits their deployment in resource-constrained environments such as cell phones, drones, autonomous robots, etc. Hence, developing robust sparse models fit for safety-critical applications has been an issue of longstanding interest. Though adversarial training with model sparsification has been combined to attain the goal, conventional adversarial training approaches provide no formal guarantee that the models would be robust against any rogue samples in a restricted space around a benign sample. Recently proposed verified local robustness techniques provide such a guarantee. This is the first paper that combines the ideas from verified local robustness and dynamic sparse training to develop `SparseVLR'– a novel framework to search verified locally robust sparse networks. Obtained sparse models exhibit accuracy and robustness comparable to their dense counterparts at sparsity as high as 99 sparsification techniques, SparseVLR does not require a pre-trained dense model, reducing the training time by 50 SparseVLR's efficacy and generalizability by evaluating various benchmark and application-specific datasets across several models.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/05/2019

The Search for Sparse, Robust Neural Networks

Recent work on deep neural network pruning has shown there exist sparse ...
02/10/2021

Bayesian Inference with Certifiable Adversarial Robustness

We consider adversarial training of deep neural networks through the len...
11/06/2018

MixTrain: Scalable Training of Formally Robust Neural Networks

There is an arms race to defend neural networks against adversarial exam...
05/30/2020

Exploring Model Robustness with Adaptive Networks and Improved Adversarial Training

Adversarial training has proven to be effective in hardening networks ag...
11/14/2022

Efficient Adversarial Training with Robust Early-Bird Tickets

Adversarial training is one of the most powerful methods to improve the ...
11/06/2018

MixTrain: Scalable Training of Verifiably Robust Neural Networks

Making neural networks robust against adversarial inputs has resulted in...
01/07/2020

PaRoT: A Practical Framework for Robust Deep NeuralNetwork Training

Deep Neural Networks (DNNs) are finding important applications in safety...

1 Introduction

The application of neural networks (NNs) in safety-critical systems requires a framework for formal guarantees regarding their functioning [reluplex]. The brittle nature of NNs renders them vulnerable to imperceptible adversarial perturbation , which can lead to inaccurate model prediction [goodfellow2015explaining], and the actions triggered based on these incorrect predictions may have catastrophic effects [make2040031].

The empirical adversarial training for NN models relies on efficiently finding adversarial perturbations around individual samples leveraging existing adversarial attacks. However, such training cannot guarantee that no other (yet unknown) attacks would find adversarial samples in the -ball of radius around the respective samples [crown].

For instance, Madry et allet@tokeneonedot [madry2018towards] used the PGD attack to generate worst-case perturbations to be used for adversarial training. Recent approaches  [Wang2020Improving_pgd, pmlr-v119-zhang20z-pgd, local-linearization_pgd] have proposed several mechanisms to improve upon the baseline adversarial training mechanism  [madry2018towards], but to generate adversarial samples for training, these approaches employ PGD attack of varying strengths. However, Tjeng et allet@tokeneonedot [tjeng2018evaluating] showed that PGD does not necessarily generate the perturbations with maximum loss; thus, the models trained to minimize PGD adversarial loss were still vulnerable to other stronger attacks. On the contrary, verified local robustness training  [reluplex, ibp_training, crown-ibp, auto-lirpa] guarantees the non-existence of adversarial samples in the vicinity of the benign samples, making it essential for safety-critical systems.

Model Compression - A Necessity. The enormous number of parameters present in dense NNs challenge their deployment in resource-constrained environments such as self-navigation, hazard detection, etc., making it essential to generate sparser NN models. The majority of the existing works that focus on obtaining sparse NN models aim to maintain the prediction (benign) accuracy of the sparse model comparable to the dense models trained with the same objective  [quantization, low-rank-factorization, han2015learning, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20]. Recently, works are aiming at the additional target of adversarial robustness for a certain attack alongside benign accuracy  [sehwag2019compact, sehwag2020hydra, Lee_2022_CVPR]. This first-of-its-kind work extends the scope to obtaining sparse NN networks which exhibit high accuracy and verified local robustness (definition is provided in Sec. 2.2) while using significantly fewer parameters (1% of the comparable verified locally robust dense networks evaluated in literature).

Pruning is the most common form of sparsification mechanism used in literature [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra]. Our preliminary study (Sec. 3) leveraging pruning mechanisms establishes the existence of verified locally robust sparse networks as sub-networks of dense verified locally robust models. Though conventional pruning fails, pruning combined with Dynamic Sparse Training (DST) [nest] can achieve such sparse networks comparable in generalizability to their denser counterparts. In this work, we referred to model generalizability as the ability of the model to project high (benign) accuracy and verified local robustness simultaneously.

However, pruning requires a fully converged dense NN to start with, which is computationally expensive. Above mentioned findings motivate this study to adapt DST to achieve locally robust sparse networks from scratch without such a requirement. Notably, this paper’s empirical evaluations show that obtaining a sparse network from scratch results in even higher generalizability and sparsity.

In summary, the contributions of the paper are:

  1. [noitemsep]

  2. This first-of-its-kind work presents SparseVLR– an approach to train a sparse locally robust model without training a dense over-parameterized model (Sec. 4). The sparse models exhibit similar generalizability but utilize significantly (up to 1%) fewer parameters than their denser counterparts evaluated in the literature.

  3. This paper performs a detailed empirical analysis to understand the high effectiveness of SparseVLR (Sec. 5.1). It identifies that the DST and using randomly initialized sparse network as the starting network, results in high gradient flow during training, allowing the model to learn new concepts by exploring new network connections.

  4. Our empirical study (Sec. 5) demonstrates that SparseVLR

    is effective on benchmark image datasets (MNIST and CIFAR-10) and model architecture combinations used in the literature, as well as on application-relevant datasets: SVHN,

    Pedestrian Detection [pedestrian-dataset]

    , and sentiment analysis NLP dataset

    [maas-etal-2011-learning].

Additionally, Sec. 2 provides relevant backgrounds and related work discussion, Sec. 6 presents evaluations on some design choices of SparseVLR, and Sec. 7 discusses the impact, limitations and future-work scope of this study. Notably, our comprehensive Appendix further discusses the datasets, hyper-parameters, seeds, models, backgrounds, and detailed analysis of different aspects of SparseVLR.

2 Background and Related-Works

2.1 Formal Robustness Verification

Formal robustness verification of non-linear NNs is an NP-complete problem (discussed in Appendix B). Recent Linear Relaxation based Perturbation Analysis (LiRPA) methods can verify non-linear NNs in polynomial time. These methods use Forward (IBP [ibp_training]), Backward (CROWN [crown], Fastened-CROWN [Lyu2020FastenedCT], -CROWN [alpha-crown], -CROWN [wang2021betacrown], DeepPoly [DeepPoly], BaB Attack [xu2021fast]) or Hybrid (CROWN-IBP [crown-ibp]) bounding mechanisms to compute linear relaxation of a model.

Specifically, for each sample , the perturbation neighborhood is defined as ball of radius as

(1)

LiRPA aims to compute linear approximations of the model ( represents model parameters), and provide lower and upper bounds at the output layer, such that for any sample and each class , the model output is guaranteed to follow:

(2)

2.2 Verified Local Robustness.

A NN model, is said to be verified locally robust  [fromherz2021fast] for a sample if: , scilicet for all the samples , the model is guaranteed to assign the same class as it assigns to

. LiRPA approaches measure the local robustness in terms of a margin vector

, where C is a specification matrix of size and is the number of all possible classes [ibp_training, crown, crown-ibp, auto-lirpa]. Zhang et allet@tokeneonedot [crown-ibp] defines for each sample as:

(3)

The entries in matrix C depend on the true class . In matrix , the row corresponding to the true class contains at all the indices. All the other rows, contain at index corresponding to the true class and at index corresponding to current class and at all the other indices. Thus, the value of is given by , which is the difference of output values of the true class with all the other classes.

represents the lower bound of the margin vector. If all the values of the are positive, , the model is said to be locally robust for sample . That implies, the model will always assign the highest output value to the true class label if a perturbation less than or equal to is induced to the sample . Thus, implies the guaranteed absence of any adversarial sample in the region of interest. Furthermore, to specify the region bounded by , we use because -norm covers the largest region [L-inf].

2.3 Model Training to Maximize Local Robustness

Since, a model is considered verified locally robust for a sample if all the values of , the training schemes proposed by previous approaches [ibp_training, crown-ibp] aim to maximize robustness of a model by maximizing the lower bound . Xu et allet@tokeneonedot [auto-lirpa] defines the training objective as minimizing the maximum cross-entropy loss between the model output and the true label among all samples (eq. 6 in  [auto-lirpa]). Thus,

(4)

where is the dataset and . Intuitively, is the sum of maximum cross-entry loss in the neighborhood of each sample in for perturbation amount . Wong and Kolter [wongNKolter] showed that the problem of minimizing (as defined in Eq. 4) and the problem of maximizing are dual of each other. Thus, a solution that minimizes , maximizes the values of , hence maximizes the local robustness for the sample . Since, , minimizing also maximizes benign accuracy, ergo maximizes generalizability altogether.

Xu et allet@tokeneonedot [auto-lirpa] computes which requires computing bounding planes and using one of the aforementioned LiRPA bounding techniques. Following Xu et allet@tokeneonedot [auto-lirpa], SparseVLR employs CROWN-IBP hybrid approach as the bounding mechanism to optimize in Eq. 4. The perturbation amount is initially set to zero and gradually increases to according to the perturbation scheduler -scheduler discussed in Sec. B.1.

Metrics for evaluations: Tab. 1 defines the metrics used to evaluate models which have been used in the previous works for formal robustness verification  [ibp_training, crown, crown-ibp, auto-lirpa].

Error Definition
Standard

% of benign samples classified incorrectly.

Verified % of benign samples for which at least one value in the is negative.
PGD % of perturbed samples, generated using 200-step PGD attack, classified incorrectly.
Table 1: Evaluation metrics. Here, % implies percentage.

2.4 Model Sparsification

Model Sparsification is the most common form of model compression, which constrains the model to use only a subset of model parameters for inference [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra]. Hoefler et allet@tokeneonedot [pruning-survey] categorizes the model sparsification based on the stage at which sparisfication is performed: train-and-sparsify: pruning a fully trained model often followed by re-training (finetuning) the retained parameters [han2015deep, zhang2018systematic, sehwag2019compact, sehwag2020hydra], sparsifying-during-training: mutiple rounds of train-and-sparsify while increasing sparsity in every round a.k.a. iterative pruning [itertaive-pruning, gate-decorator] and sparse-training: training a sparse network using either a static [frankle2020linear, frankle2018the, Stable_lottery, savarese2020winning, pmlr-v119-malach20a] or dynamic [nest, itop] mask.

Recently developed dynamic-mask-based sparse-training known as Dynamic Sparse Training (DST) [nest, itop] approaches iteratively follow grow-and-prune paradigm to achieve an optimal sparse network. Starting from a random sparse seed network, new connections (i.e., parameters) which minimize the natural loss are added (grow step), and the least important connections are removed (prune step). DST’s higher efficacy than the static-mask-based sparse-training is attributed to high gradient flow allowing the model to learn an optimal sparse network for effective inference generation [DST_reason]. SparseVLR adapts the DST approach to identify and train verified locally robust sparse networks from scratch.

Parameter removal or pruning: SparseVLR uses global magnitude-based pruning, an unstructured pruning mechanism [pytorch-prune], as multiple works [global-pruning-see, global-pruning-1, global-pruning-2, global-pruning-3, global-pruning] suggest its higher efficacy in achieving more generalizable sparse networks. It also captures the importance of a layer as a whole [movement-pruning]. Additionally, magnitude-based pruning approaches [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra, eie] remove the least magnitude parameters suggesting that the importance of a parameter is directly proportional to the magnitude of its weights.

There is rich literature on sparsification mechanisms and parameter removal. A detailed summary is in Appendix C.

3 Motivation: Existence of Verified Locally Robust Sparse Network

Zhang et allet@tokeneonedot [crown-ibp] noted that the verified local robustness training mechanism proposed by Wong and Kolter [wongNKolter] and Wong et allet@tokeneonedot [scalable_fv] induce an implicit regularization, thus, penalizing the magnitude of model parameters during training. CROWN-IBP [crown-ibp] training incurs less regularization and shows an increasing trend in the magnitude of model parameters while training. According to our preliminary analysis discussed in Appendix D, the implicit regularization caused by CROWN-IBP does penalize networks’ parameters, making them smaller as compared to naturally trained networks, causing a high fraction of the parameters to be close to zero (magnitude ). Removal of such less significant parameters has minimal impact on model generalizability, which indicates the existence of a sparse sub-network that exhibits verified local robustness comparable to its dense counterpart.

(a) 4-layer CNN for MNIST
(b) 4-layer CNN for CIFAR-10
Figure 1: Accuracy and local robustness of dense model (original model) vs pruned model at different pruning amounts obtained using (i) global pruning (pruned model), (ii) global pruning followed by fine-tuning (finetuned model), and (iii) re-training using grow-and-prune (gow and prune) for 4-layer CNN for (a) MNIST () and (b) CIFAR-10 ()

Fig. 1 demonstrates the standard and verified errors for 4-layer CNN trained to minimize defined in Eq. 4 for benchmark datasets MNIST and CIFAR-10 at various compression amounts. The evaluations are shown for the () Global Pruning without any finetuning or re-training, () Conventionally finetuning a globally pruned network using a static mask, and () Re-training the pruned network using DST-based grow-and-prune technique [nest].

Fig. 1 shows that as the compression ratio increases , weights having higher significance on model inference become the pruning target. Thus, the model generalization, in terms of natural accuracy and robustness, drops significantly, resulting in a trivial classifier111A classifier that always predicts the same class irrespective of the input in extreme cases. Conventional finetuning () increases the standard and verified error even at small compression amounts. However, re-training using grow-and-prune () maintains the generalizability of the sparse models at relatively high compression amounts. A comparison of () & () for 7-layer CNN and Resnet is provided in Appendix E.

The effectiveness of grow-and-prune () in re-training a pruned model can be attributed to the better gradient flow encountered during training as compared to conventions static-mask-based finetuning (), which is in agreement with [DST_reason]. A discussion of their gradient flow and loss evolution during training is provided in Appendix E.

Moreover, according to our preliminary analysis, locally robust sparse training leveraging static-mask-based sparsifying-during-training approaches such as iterative pruning [itertaive-pruning] results in inferior performance.

This section establishes (1) the existence of verified locally robust sparse sub-network in dense verified robustly trained models; and (2) grow-and-prune training paradigm can help identify such sparse networks.

4 Verified Locally Robust Sparse Model via Dynamic-Mask-based Sparse-training

Motivated by the above-discussed findings, this paper presents a dynamic-mask-based sparse-training mechanism to achieve a verified locally robust sparse network without training a dense over-parameterized model. SparseVLR is based on grow-and-prune paradigm [nest], which allows network connections to expand under the constraints of an underlying backbone architecture. Each parameter of the backbone model can either be in active or dormant state based on whether or not it belongs to the sparse network at a particular instant. The weight of a dormant parameter is set to zero. The percentage of parameters that are dormant is the network’s current sparsity.

4.1 Problem Definition:

For a given randomly initialized backbone architecture with parameters, the objective is to train a verified locally robust sparse network which uses only a subset of available parameters (of the backbone), that is, and is an order of magnitude times smaller than , . The sparse model thus achieved should exhibit high accuracy and verified local robustness, comparable to a verified locally robust dense network having the same backbone architecture.

1:: Backbone architecture : Target sparsity : Dataset : Maximum Perturbation

: Number of training epochs

: Start and length of perturbation scheduler seed: A seed value
2:Randomly initialize the backbone architecture using seed.
3: Select the highest (100-p)% parameters of to form a sparse seed network while making remaining parameters dormant.
4:for epoch  do
5:     -Scheduler)
6:     for minibatches  do
7:           Auxiliary Model is denser than target sparsity, i.e.
8:     end for
9:      percentile of Select the highest (100-p)% parameters of , ergo the highest parameters of to obtain target sparsity
10:end for
Algorithm 1 SparseVLR

4.2 SparseVLR: The Approach

Algorithm 1 describes the SparseVLR procedure. It requires a backbone architecture ; a target sparsity , the dataset ; and the maximum perturbation

. It also requires hyperparameters that vary for different datasets and includes the number of training epochs

and the inputs for the -scheduler: and (discussed in Appendix A).

SparseVLR starts with an untrained random sparse (i.e., seed) network while keeping parameters of the backbone as dormant (). It is an iterative procedure, and each iteration comprises two phases: Thickening: aims to explore new parameters in the backbone to maximize verified local robustness resulting in lesser sparsity, and Pruning: removes parameters that hold lesser importance for model inference to reach the target sparsity at the end of each iteration.

Seed Network (lines 1-2 in Algorithm 1) Given a backbone architecture, a sparse seed network is computed by randomly initializing all the backbone parameters, followed by retaining the parameters having only the top weight magnitudes. All the other parameters of the backbone are set to zero (made dormant).

Thickening (lines 4-7 in Algorithm 1) This phase aims to densify the sparse network by inspecting all possible parameters (within the backbone). The parameter weights, including the dormant ones, are updated according to their corresponding gradients. This is also referred to as gradient based growth phase in the literature [nest, DST_reason, itop]. However, in literature, gradients are computed for the objective of maximizing benign accuracy, that is, minimizing the standard error; in contrast, SparseVLR aims to maximize verified local robustness of the model by minimizing the training loss as defined in Eq. 4 for each mini-batch . The perturbation used for computing is computed using the perturbation scheduler scheduler as discussed in Sec. B.1.

The updated network is called auxiliary network . Since there is no restriction on the capacity of the auxiliary network, it can have more than active parameters.

Pruning (line 8 in Algorithm 1) The auxiliary network obtained after the Thickening phase needs to be compressed to the target size . This phase aims to reduce the model size in terms of the number of active parameters while having minimum impact on the generalizability. As discussed in Sec. 2, to be consistent with the state-of-the-art, we employ -unstructured global pruning as the model sparsification mechanism [paszke2017automatic], to attain the sparse network with target size . Our sparsification uses -norm of individual model parameters as the metric to decide its importance towards model inference. Notably, -norm corresponds to the magnitude of the weight of the individual parameters.

However, Liu  et allet@tokeneonedot [global-pruning-2] also showed that global pruning could result in a disconnected network. This can be problematic for the approaches which use single-shot pruning, that is, pruning a fully trained network followed by finetuning only the retained parameters. SparseVLR dynamically explores and selects the connections which minimize the training loss, thus, eliminating the chances of receiving a disconnected network.

5 Experiments

This section empirically demonstrates: 1) The effectiveness of SparseVLR in achieving verified locally robust sparse models (Sec. 5.1). (2) The dynamic masking used by SparseVLR results in high gradient flow (Sec. 5.1(i)-(ii)), which allows the model to learn new concepts (Sec. 5.1(iii)), which aligns with the finding of [DST_reason]. (3) Inheriting the weight initialization from a fully converged dense model to obtain a sparse model result in ineffective training at high sparsity (Sec. 5.1(iv)). Moreover, (4) the generalizability of SparseVLR is demonstrated by evaluating sequential networks (i.e., LSTM), and (5) finally, an evaluation is shown for practical application relevant datasets: SVHN (Street View House Numbers), Pedestrian Detection[pedestrian-dataset], and NLP dataset for sentiment analysis[maas-etal-2011-learning].

Datasets and Models: To establish the empirical effectiveness of SparseVLR, we use benchmark datasets MNIST and CIFAR-10, which are used in the previous works  [ibp_training, crown, crown-ibp, scalable_fv, wongNKolter, auto-lirpa]. To demonstrate the versatility of the approach across different architectures, we evaluate (a) Two CNNs with varying capacity: 4-layer CNN and 7-layer CNN  [ibp_training, crown-ibp]; (b) Network with skip connections: A 13-layer ResNet used by  [scalable_fv] (c) A sequential network: LSTM used by  [auto-lirpa]. To demonstrate the effectiveness and generalizability of SparseVLR, evaluations are shown for application-relevant datasets: SVHN, Pedestrian Detection [pedestrian-dataset], and Sentiment Analysis [maas-etal-2011-learning]. Further details about perturbations and hyperparameters are discussed in Appendix A.

Formal Verification Mechanisms to compute and verified error: While training a sparse network, SparseVLR employs CROWN-IBP  [crown-ibp]. In Sec. 6, we show empirically that the sparse models trained using CROWN-IBP exhibit lower errors than sparse models obtained using other bounding mechanisms such as IBP  [ibp_training], CROWN  [crown], and Fastened-CROWN  [Lyu2020FastenedCT]. For evaluations, to be consistent with the state-of-the-art  [ibp_training, crown-ibp, auto-lirpa], the presented evaluations use IBP to compute verified error.

Dataset MNIST CIFAR-10 Model Sparsity 0% 90% 95% 99% 0% 90% 95% 99% Type Error 4-layer CNN Standard 5.29 4.79 4.78 6.97 59.88 59.32 60.73 90.0 PM PGD 10.93 6.30 6.23 8.98 67.25 66.01 66.74 90.90 Verified 18.07 17.22 17.46 21.35 71.5 71.04 72.12 90.0 Standard 5.71 6.59 8.09 60.78 61.26 64.14 RSM PGD 7.62 9.00 10.63 67.08 67.13 69.09 Verified 19.33 20.84 23.14 71.90 72.45 73.43 7-layer CNN Standard 2.27 2.05 3.37 2.11 56.08 55.33 55.24 90.0 PM PGD 6.19 3.93 4.85 3.61 68.65 67.59 67.21 90.0 Verified 12.2 11.96 14.51 12.06 68.66 68.49 69.6 90.0 Standard 2.32 3.01 2.76 55.41 56.96 59.14 RSM PGD 3.88 4.21 4.75 68.26 68.30 67.04 Verified 12.39 12.6 12.83 68.84 69.12 71.21 Resnet Standard 3.65 3.23 3.17 3.12 53.6 55.17 90.0 90.0 PM PGD 9.11 4.16 4.04 3.98 63.83 66.25 90.0 90.0 Verified 14.30 13.52 13.34 13.80 68.58 69.28 90.0 90.0 Standard 3.91 3.88 4.95 56.30 57.99 60.31 RSM PGD 5.05 5.16 6.77 67.44 67.04 68.88 Verified 15.06 15.39 17.74 69.33 70.14 71.38 Table 2: Error Metrics for 4-layer CNN, 7-layer CNN and Resnet trained for MNIST ( and ) and CIFAR ( and ) at different sparsity amounts (a) Sparsity = 95% (b) Sparsity = 99% Figure 2: Variation of training loss for CIFAR-10 dataset on 4-layer CNN with at two sparsity amounts (a) 95% and (b) 99%

Nomenclature & abbreviations for the Evaluated Sparse Networks: We aim to train an optimal sparse network that exhibits high accuracy and verified robustness starting from a random sparse network as the seed. However, conventional pruning approaches generally require a fully converged dense network as the starting point. To compare, this section evaluates thickening-and-pruning-based sparse model training (lines - of Algorithm 1) using two different starting/seed sparse network variations:

  • [noitemsep]

  • Random Sparse Model (RSM): This is the seed network used in SparseVLR in Sec. 4.2 and is obtained by applying global magnitude-based pruning to the randomly initialized backbone architecture.

  • Pruned Model (PM): PM is the same as () model in Sec. 3, obtained by applying global magnitude-based pruning to a fully converged dense model.

The final models obtained after applying Algorithm 1 (lines 3-9) to PM and RSM are denoted by PMand RSM. PMis similar to the () in Sec. 3.

5.1 Establishing the Effectiveness of SparseVLR:

Tab. 2 shows the results for empirical analysis of SparseVLR for the benchmark datasets MNIST and CIFAR10. The results are shown for three models: 4-layer CNN, 7-layer CNN, and Resnet at sparsity amounts 90%, 95%, and 99%. The models are evaluated through standard, PGD, and Verified errors as described in Sec. 2. The column corresponding to 0% sparsity represents the error metrics computed for dense models (having the same architecture as the backbone in Algorithm 1) trained using Xu et allet@tokeneonedot [auto-lirpa], which aligns well with the state-of-the-art  [crown-ibp, scalable_fv]. The presented results are averaged over five random initializations (seeds) on Algorithm 1.

According to the Tab. 2, RSMachieves similar evaluation metrics (within 3% verified error) as the dense (0% sparsity) models across all datasets, models, and sparsity amounts. These results establish that (i) SparseVLR can train a verified locally robust sparse model without needing a dense over-parameterized network trained till convergence, and (ii) The sparse model thus obtained is comparable to the verified locally robust dense network with the same architecture as the backbone.

Also, RSMoutperforms PM(the best approach from Sec. 3). Tab. 2 shows that across all model-dataset scenarios, the error metrics for RSMare comparable to PMat smaller sparsity and at higher sparsity (e.g., 99%) RSMexhibits much less error than PM, establishing SparseVLR’s (Algorithm 1) higher efficacy compared to the conventional pruning mechanisms. Comparisons for PM, PM, and RSMare provided in Appendix E.

To investigate the reason for the better performance of SparseVLR, a detailed analysis for 4-layer CNN trained for CIFAR-10 is provided; however, similar patterns were observed for other models as well.

(i) Why the RSMvs. PMdisparity at higher sparsity?

Fig. 2(a) depicts the loss evolution for PMPMand RSMRSMtraining at 95% sparsity, and reveals that the losses eventually stabilize at the same level, which in turn is governed by the maximum perturbation . SparseVLR uses -scheduling, which provides perturbation for initial epochs and then gradually increases the perturbation value. This results in an initial drop in losses, as seen in figures Fig. 2(a). As the perturbation amount increases, the loss increases as well.

Notably, at 99% sparsity (Fig. 2(b)), the stabilized loss for PMis significantly greater than RSM. Higher loss leads to higher error, hence, the observation. The evolution of training loss of PMis discussed further in Appendix E. The reason for the higher stabilized loss for PMat high sparsity is discussed below.

(ii) Reason for disparity in the stabilized losses for RSMand PM: Gradient flow over

Evci et allet@tokeneonedot [DST_reason] claims that different initialization may lead to different gradient flow, which in turn leads to varied accuracy of the sparse model trained via DST. To explain the disparity in stabilized losses for PMand RSMat 99% sparsity (Fig. 2b), we plot the observed gradient flows in Fig. 3(b). It depicts that the -norm of gradients observed while training RSMRSMis greater than the gradient norm incurred while training PMPM, thus, allowing the model to learn new concepts (further discussed below), hence incurring a lower stabilized loss.

(a) Weight distribution
(b) Gradient flow
Figure 3: (a)Distribution of weights and (b) Gradient flow encounters while training RSM, PM and RIPM for a 4-layer CNN for CIFAR-10 dataset at 99% sparsity

(iii) How poor gradient flow hinders the ability of the sparse network to change connections (i.e., learning new concepts) during training?

The effect of poor gradient flows is two-fold: (i) a lesser number of parameters explored (activated) during the thickening phase, and (ii) relatively small magnitude in the newly activated parameters of the auxiliary model.

Figure 4: Changes in sparse model structure while training a 4-layer CNN for CIFAR-10 dataset at 99% sparsity using RSM, PM and RIPM as seed networks

The change in the structure of a sparse network obtained after a pruning step can be quantified in terms of the number of parameters of the backbone whose state changed from active to dormant and vice-versa. Fig. 4 depicts the changes in the sparse structure during training. Notably, PMPMtraining incurs almost no change, whereas the sparse structure of RSMRSMtraining is quite dynamic. This can be attributed to smaller magnitudes of newly activated parameters which become an easy target of removal during the pruning step, resulting in a similar sparse structure. It further hinders the sparse network’s ability to learn new concepts, leading to a higher loss in PMthan RSM.

(iv) Cause for inefficient learning while training PMPM- Is it the structure or the weight initialization?

PM inherits parameter weights from a pre-trained dense model, and the PM structure is learned through global magnitude-based pruning. This section investigates whether the lower performance of PMat high sparsity (e.g., 99%) is due to PM’s retained parameter weights or learned network structure.

For this investigation, we evaluate the Randomly Initialized Pruned Model (RIPM) as the seed network, which randomly initializes PM’s parameters while keeping the learned structure intact. Tab. 3 compares RIPM, the sparse model obtained via training RIPMRIPMusing Algorithm 1 (lines ), with PM, PM, and RSM. It can be observed that RIPMgeneralizes comparable to RSM. Also, Fig. 4 suggests that while training RIPMRIPM, the model can learn new connections (high variation in the sparse network structure) in the first half of training.

However, Fig. 3(b) shows that the gradient flow while training RIPMRIPMis only marginally better than the gradient flow observed while training PMPM. Then, why RIPMis better than PM?

Comparison of the initial seed sparse networks’ weight distribution shown in Fig. 3(a) demonstrates that the magnitude of active weights in RIPM is lower than that of PM. As we know, the poor gradient flows result in low magnitudes of newly activated parameters’ weights after the thickening phase, and the magnitude of parameter weights of RIPM (i.e., initially active) are also low; allowing newly active parameters to be selected (and replace previously active ones) during the pruning step, leading to changes in the structure of sparse network during training. Such changes allow the model to learn new concepts, leading to higher performance in the obtained RIPM.

Thus, the parameter weights inherited from a fully converged dense network are responsible for PMPM’s inefficient learning at high sparsity amounts. This finding further motivates the effectiveness of training a verified locally robust sparse network from scratch rather than pruning an over-parameterized pre-trained dense model.

Error PM PM RSM RIPM
Standard 90.0 90.0 65.01 65.81
PGD 90.0 90.90 70.43 69.85
Verified 90.0 90.0 73.38 74.97
Table 3: Comparison for error metrics for sparse models obtained using different seed networks for a 4-layer CNN at 99% sparsity

5.2 Application to Sequential Model

To further investigate the application of SparseVLR to sequential networks, we evaluated a small LSTM [auto-lirpa] as a backbone architecture to obtain sparse models for MNIST. Tab. 4 demonstrates the standard, PGD, and verified errors for the sparse models obtained using SparseVLR. It can be noted that up to 95% sparsity amount, the sparse models generalize comparable to their dense counterparts. Additionally, higher error at 99% sparsity can be attributed to the insufficient number of parameters for the target task.

Sparsity 0% 80% 90% 95% 99%
Error
Standard 4.04 4.42 4.38 5.52 11.55
PGD 9.35 11.60 10.10 13.84 17.22
Verified 14.23 11.85 10.75 13.90 18.70
Table 4: Error metrics for sparse models obtained using SparseVLR using LSTM trained for MNIST at various sparsity amounts

5.3 Evaluations for Application-Relevant Datasets

Tab. 5 demonstrates the results for two application-relevant image datasets: (a) SVHN and (b) a pedestrian detection dataset that aims at differentiating between people and people-like objects [pedestrian-dataset]. The backbone architecture used for these evaluations is a 7-layer CNN, adapted to the respective datasets. For both SVHN and Pedestrian detection, the error metrics of the models having 99% sparsity are comparable to that of their dense counterparts. The perturbation amount used for SVHN and Pedestrian Detection are 8/255 and 2/255, respectively. Additionally, sparse models constructed for a sentiment analysis NLP dataset[maas-etal-2011-learning] on an LSTM model [word-formal, auto-lirpa] exhibit errors within 1% of the dense counterparts, showing SparseVLR’s generalizability (detailed results presented in Appendix F).

Dataset SVHN Pedestrian Detection
Sparsity 0% 95% 99% 0% 95% 99%
Error
Standard 62.55 62.77 65.87 28.40 26.22 28.40
PGD 68.19 68.23 70.52 42.35 39.66 43.03
Verified 74.87 75.21 76.76 63.53 67.39 63.02
Table 5: Error metrics for sparse models obtained using SparseVLR using 7 layer CNN as backbone architecture trained for (a) SVHN and (b) Pedestrian Detection at different sparsities

6 Design Choices for SparseVLR

Pruning after every thickening step: Liu et allet@tokeneonedot [itop] suggests that allowing the model to grow for multiple epochs before pruning results in exploring all the parameters at least once. However, the implicit sparsification caused by the (discussed in Sec. 3) does not allow such exploration. Our empirical evaluations show that pruning step after every thickening step results in the most optimal sparse models.

Using CROWN-IBP as bounding mechanism: Tab. 6 compares the errors exhibited by 4-layer CNN at 99% sparsity when trained by employing different bounding mechanisms to compute : IBP [ibp_training], CROWN [crown], Fastened-CROWN [Lyu2020FastenedCT], and CROWN-IBP [crown-ibp]. Notably, empirically CROWN-IBP results in the most optimal models for both MNIST and CIFAR. Hence, SparseVLR employs CROWN-IBP to compute .

Dataset Method IBP CROWN CROWN- CROWN-
Error Fast IBP
MNIST Standard 7.44 52.3 13.53 7.23
= 0.4 PGD 10.27 68.84 18.20 9.68
Verified 21.54 93.13 42.18 21.95
CIFAR Standard 62.85 69.04 68.72 65.01
= 8/255 PGD 78.21 71.53 81.69 70.43
Verified 78.52 88.3 89.39 73.38
Table 6: Error metrics for different formal verification mechanism used to compute on 4 layer CNN trained for MNIST and CIFAR-10 at 99% sparsity

7 Discussion on Impact and Limitations

Impact: SparseVLR will enable resource-constraint platforms to leverage verified robust models. The sparse training reduces the total training time to almost half that of conventional pruning. Moreover, the inference time of the sparse models is five times lesser, on average, than their dense counterparts on NVIDIA RTX A6000 (detailed results presented in Appendix H).

Limitations and Future-works: SparseVLR uses CROWN-IBP; thus, the results obtained are restricted by the robustness achieved by CROWN-IBP. Therefore, SparseVLR will need to be tailored for further advancement in verified local robustness training mechanisms. Secondly, the empirical analysis has been done for models used in the literature [crown-ibp, scalable_fv], and we show that sparse models exist for such architectures. However, further analysis of more complex networks and domains would be beneficial, which is out of the scope of this paper. Finally, there is rich literature on pruning. However, we only investigated global unstructured pruning mechanisms; since the focus was on training sparse models from scratch, and a rigorous investigation of pruning was out of the scope. Our preliminary results on applying static-mask-based sparse-training paradigm based on ‘Lottery Ticket Hypothesis’ [frankle2018the] for verified locally robust sparse NN search show inferior performance for this domain, suggesting the requirement for further investigation leveraging more sophisticated approaches for finding the lottery ticket. Finally, following most studies in literature, this paper addresses up to 99% model sparsification. However, our analysis shows on extreme sparsity (99.9%), generalizability decrease, indicating optimal sparsity is between that range (99-99.9%). A detailed discussion is in Appendix G. Though finding optimal sparsity is out of scope, future work on this scope would be beneficial.

8 Conclusion

This is the first work investigating and showing that a verified locally robust sparse network is trainable from scratch, leveraging a dynamic-mask-based mechanism. The ability to train a sparse network from scratch renders the conventional requirement of training a dense network nonobligatory and reduces the training time to almost half. Additionally, the empirical analysis demonstrates SparseVLR’s generalizability, thus, enabling the deployment of verified locally robust models in resource-constrained safety-critical systems.

References

Appendix A Experiment Details

Datasets: The empirical analysis of SparseVLR is done for 5 datasets: MNIST, CIFAR, SVHN, Pedestrian Detection and Sentiment Analysis.

MNIST is a dataset of hand-written digits with 60000 samples in the training set and 10,000 samples in the testing set. All the images are in greyscale and have a size of 2828. The training set is generated using samples from approximately 250 writers. The writers for the test set and the training set are disjoint.

CIFAR-10 is a dataset of 60000 images evenly distributed among 10 mutually exclusive classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The training set and the testing set consists of 50000 and 10000 images, respectively. These RGB images are of size 32

32 each. Additionally, The test set contains 1000 randomly selected images from each class, thus having a uniform distribution over all the classes.

SVHN is a real-world dataset for digit recognition. The images are Street View House Numbers obtained from google street view images. The format used for the evaluations for SparseVLR contains 3232 images of individual digits with distractions. The dataset contains 73257 images in the training set and 26032 images for testing.

Pedestrian Detection [pedestrian-dataset] aims to differentiate between the person and person-like objects such as statues, scarecrows, etc., having very similar features to a person. The number of images in train, validate and test set are 944, 160, and 235, respectively, with a total of 1626 persons and 1368 non-human labeling.

SparseVLR is applicable to domains other than image classification. To illustrate this, an empirical study has been done on sentiment analysis NLP dataset [maas-etal-2011-learning] used in the previous formal verification approaches by [auto-lirpa, word-formal]. The dataset contains 10662 sentences evenly distributed between two classes: positive and negative.

Perturbation amount: For image classification datasets, the perturbations are introduced in ball of radius (perturbation amount), (defined in Sec. 2). The highest perturbation amounts (i.e., most difficult scenarios) used in the literature addressing formal verification and training  [ibp_training, crown-ibp, scalable_fv, auto-lirpa] are 0.4 and 8/255 for MNIST and CIFAR10, respectively. For CIFAR10, Zhang et allet@tokeneonedot [crown-ibp] uses different perturbation amounts for training () and testing (), such that , we follow the same convention. Thus, for the models that need to be tested for 8/255, the training is done for a perturbation amount of 8.8/255. For SVHN, the previous works targetting adversarial robustness [sehwag2020hydra], use perturbation amount of 8/255, thus, following the convention of CIFAR-10, we use and . For Pedestrian Detection dataset, we use and .

For Sentiment analysis NLP dataset, the perturbations are introduced in terms of synonym-based word substitution in a sentence. Each word has a set of synonyms with which it can be substituted. is computed using 8 nearest neighbors in counter-fitted word embedding, where the distance signifies similarity in meaning [word-formal, auto-lirpa]. The number of words being substituted is referred to as budget (). The maximum budget used in state-of-the-art [auto-lirpa, word-formal] for formal verification is 6, and we use the same amount.

Hyperparameters: The presented approach requires a set of hyperparameters as inputs: , where is the total number of training epochs, is the epoch number at which the perturbation scheduler should start to increment the amount of perturbation and specifies the length of the schedule, that is, the number of epochs in which the perturbation scheduler has to reach the maximum amount of perturbation .

These values are as follows: (1) For 4-layer CNN, 7-layer CNN, and Resnet, we use and , for MNIST and CIFAR-10, respectively; (2) For LSTM trained for MNIST, we use ; (3) For SVHN, the hyperparameters are same as that of CIFAR-10. (4) For Pedestrian Detection, the hyperparameters resulted in the models displaying the least errors empirically; and (5) For Sentiment Analysis NLP dataset, the hyperparameters are used. The value of for MNIST, CIFAR-10, and Sentiment Analysis are in compliance with state-of-the-art [auto-lirpa]. For Pedestrian Detection and SVHN no prior hyperparameter values are available.

Learning rate: The training approach uses the learning rate scheduler, which gradually decreases the learning rate as the training proceeds after the maximum perturbation is reached. The starting learning is set to as suggested by the state-of-the-art [auto-lirpa].

Appendix B Formal Verification

An NP-Complete problem:Linear Programming  [Bastani2016MeasuringNN-lp] and Satisfiability Modulo Theory (SMT)  [smt-formal] have been used in literature as formal verification techniques to validate neural networks. However, Pulina and Tacchella [smt-challenge]

showed that Multi-Layer Perceptrons (MLPs) cannot be verified using SMT solvers, thus, challenging the scalability of SMT solvers to neural networks of realistic sizes. Furthermore, several applications require non-linear activation functions to learn complex decision boundaries, rendering the whole NN non-linear. The large size and non-linearity of NNs make the problem of formal robustness verification non-convex and NP-Complete  

[reluplex, sinha2018certifiable].

Relaxation based Methods: To address the challenges mentioned above, Katz et allet@tokeneonedot [reluplex]

used relaxation of ReLU activations, temporarily violating the ReLU semantics, resulting in an efficient query solver for NNs with ReLU activations. The piece-wise linearity of the ReLU activations is the basis for this relaxation. LiRPA based methods such as CROWN 

[crown], IBP [ibp_training], CROWN-IBP [crown-ibp], etc. compute linear relaxation of a model containing general activation functions which are not necessarily piece-wise linear.

CROWN [crown] uses a backward bounding mechanism to compute linear or quadratic relaxation of the activation functions. IBP [ibp_training] is a forward bounding mechanism that propagates the perturbations induced to the input towards the output layer in a forward pass based on interval arithmetic, resulting in ranges of values for each class in the output layer. CROWN-IBP [crown-ibp] is a hybrid approach that demonstrated that IBP can produce loose bounds, and in order to achieve tighter bounds, it uses IBP  [ibp_training] in the forward bounding pass and CROWN [crown]

in the backward bounding pass to compute bounding hyperplanes for the model. CROWN-IBP has been shown to compute tighter relaxations, thus, providing the most optimal results for formal verification.

As discussed in Sec. 2, the linear relaxation computed using LiRPA methods can be used to train verified locally robust models by minimizing . This learning procedure uses a perturbation scheduler to gradually increase the amount of perturbation during training.

b.1 Perturbation Scheduler

-scheduler aims to provide a perturbation amount for every training epoch , which gradually increases starting at epoch s. The schedule starts with a 0 perturbation and reaches in epochs [ibp_training]. Gowal et allet@tokeneonedot [ibp_training] proposed using scheduler for training based on LiRPA and deems it necessary for an effective learning procedure. Since IBP [ibp_training] uses interval propagation arithmetic to propagate input perturbation to the output layer, where the interval size usually keeps on increasing as the propagations reach deeper in the model, the gradual increase of epsilon prevents the problem of intermediate bound explosion while training. Other related approaches also adopted the idea of gradual perturbation increment [crown-ibp, auto-lirpa]. Since the bounding mechanism used in SparseVLR is CROWN-IBP [crown-ibp], the use of -scheduler is crucial.

Appendix C Model Compression

Several model compression approaches have been proposed to reduce the computational and storage requirements of deep neural networks (DNNs). Quantization aims to convert the parameter values to low-precision approximations requiring less storage bits per parameter  [10.5555/3122009.3242044, Gong2014CompressingDC, quantization, chmiel2021neural]. Model Distillation  [KD_main] and Neural Architecture Search  [NAS] train a smaller dense architecture that generalizes comparable to the dense network. Low-Rank Factorization  [low-rank-factorization] computes a matrix decomposition of the weight matrices, which requires fewer floating point operations. Model Sparsification/Pruning is the most common form of model compression, which constrains the model to use only a subset of original parameters for inference  [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra].

c.1 Model sparsification.

Hoefler et allet@tokeneonedot [pruning-survey] categorizes the model sparsification according to the stage at which sparsification is performed: train-and-sparsify (After training), sparsifying-during-training (While Training) and sparse-training (Before training).

Train-and-sparsify approaches  [han2015deep, zhang2018systematic, sehwag2019compact, sehwag2020hydra] train a dense model and remove (or prune) the parameters contributing the least towards model inference. Since the removal of parameters may lead to loss of information learned by the original dense model, the removal of parameters is often followed by re-training the retained parameters, also known as the finetuning step. Train-and-sparsify associates a static binary mask with the parameters, which allows the finetuning step to update only the retained parameters.

The sparsifying-during-training mechanism gradually removes NN model parameters in small fractions per round. Each round is followed by a finetuning step allowing the model to recover from the parameter removal step  [itertaive-pruning, gate-decorator]. This is often called iterative pruning.

The sparse-training mechanism involves training a sparse network from scratch by using either a static or a dynamic mask. Identifying static masked sparse network is motivated by Lottery Ticket Hypothesis (LTH)  [frankle2018the] which states that “dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations”. However, finding the static mask in LTH  [frankle2020linear, frankle2018the, Stable_lottery, savarese2020winning, pmlr-v119-malach20a] still requires a fully or partially trained dense model to start with.

Recently, dynamic-mask-based sparse-training mechanisms known as Dynamic Sparse Training (DST) have been developed. DST approaches  [nest, itop] start with a randomly initialized compressed sparse network and explore new network connections while learning network weights for the target task during training. SparseVLR adapts a DST approach to identify verified locally robust sparse network from scratch.

c.2 Selecting the parameters to be removed:

While all sparsification approaches mentioned above perform network parameters (i.e., weight) removal, selecting such parameter weights is done based on different criteria such as the magnitude of weights, magnitude of corresponding gradients, minimal impact on Hessian matrix, etc.  [pruning-survey]. Magnitude-based pruning is the most popular pruning mechanism  [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra, eie], which removes the least magnitude parameters suggesting that the importance of a parameter is directly proportional to the magnitude of their weights.

Moreover, removal of parameters can be done either per-layer or globally, whereas recent studies  [global-pruning-see, global-pruning-1, global-pruning-2, global-pruning-3, global-pruning] suggest the higher efficacy of global unstructured pruning. Sanh et allet@tokeneonedot [movement-pruning]

demonstrated that global magnitude-based pruning tends to prune earlier layers of a network by a lesser amount. Since the earlier layers for most networks are feature extraction layers which hold more importance for a classification task, it can be concluded that global pruning captures the importance of a layer as a whole.

Appendix D CROWN-IBP causes regularization

(a) 4-layer CNN for MNIST
(b) 4-layer CNN for CIFAR-10
(c) 7-layer CNN for MNIST
(d) 7-layer CNN for CIFAR-10
Figure 5: Weight distributions for (a) 4-layer CNN (MNIST) (b) 4-layer CNN (CIFAR-10) (b) 7-layer CNN (MNIST)(d) 7-layer CNN (CIFAR-10). The epsilon amounts used for robust training are 0.4 and 8.8/255 for MNIST and CIFAR-10, respectively.

The motivation for SparseVLR is that a huge fraction of parameters of locally robust models trained using CROWN-IBP have low magnitudes, so removal of such less significant parameters has minimal effect on the model generalizability (Sec. 3).

Fig. 5 shows the weight distribution of models trained to minimize natural, and verified locally robust (CROWN-IBP [crown-ibp]) losses. The trends are shown for 4-layer and 7-layer CNNs trained for MNIST and CIFAR-10 datasets. As evident from the distribution of weights, a higher fraction of weights in verified locally robust models have very low magnitudes (). This observation suggests that models trained to minimize (Eq. 4) penalize the parameter magnitudes.

Appendix E Explaining the generalizability difference for various sparse training mechanisms

e.1 Conventional Finetuning vs Grow-and-Prune

Conventional finetuning aiming to minimize (() in Sec. 3) increases the standard and verified error even at small compression amounts. Such increment can be attributed to poor gradient flow encountered during static mask-based finetuning, which is in agreement with  [DST_reason]. However, retraining a pruned model using grow-and-prune (() in Sec. 3) does not suffer from poor gradient flow, hence may help maintain the generalizability of the sparse compressed models.

(a) Gradients for CIFAR-10
(b) Loss for CIFAR-10
Figure 6: (a) Gradient flow and (b) loss during training a globally pruned 4-layer CNN using fine-tuning, and re-training using grow-and-prune CIFAR-10 datasets for perturbation of 0.4 and 8/255, respectively

To further investigate, Fig. 6 compares the losses and -norm of gradients encountered while re-training a globally pruned network using conventional static mask-based finetuning (while keeping the mask the same throughout the training), and grow-and-prune mechanism  [nest]. The results are shown for a 4-layer CNN trained for CIFAR-10 at 90% sparsity. It is important to note that the training procedure uses -scheduling, which provides perturbation for initial epochs and then gradually increases the perturbation value. This results in an initial drop in losses, as seen in Fig. 6(b). As the perturbation amount increases, the loss increases as well and finally stabilizes.

Notably, the losses encountered by finetuning are higher than retraining using grow-and-prune, thus resulting in higher errors in the obtained compressed models (shown in Fig. 6(b)). The difference in losses encountered using the two approaches can be attributed to the difference in gradient flows (see Fig. 6(a)). Re-training using grow-and-prune incurs higher gradient flows than conventional finetuning, and according to Evci et allet@tokeneonedot [DST_reason], higher gradient flows lead to efficient training.

Dataset MNIST CIFAR-10

Model

Sparsity 0% 80% 90% 95% 99% 0% 80% 90% 95% 99%
Type Error

4-layer CNN

Standard 5.11 5.98 12.98 59.83 60.0 61.28 84.33 90.0
PM PGD 6.75 7.84 16.03 62.19 67.3 68.05 86.43 90.0
Verified 18.49 22.80 40.7 79.18 71.41 72.02 87.4 90.0
Standard 5.29 4.78 4.79 4.78 6.97 59.88 58.81 59.32 60.73 90.0
PM PGD 10.93 6.34 6.30 6.23 8.98 67.25 68.46 66.01 66.74 90.90
Verified 18.07 17.71 17.22 17.46 21.35 71.5 70.86 71.04 72.12 90.0
Standard 5.67 5.71 6.59 8.09 60.83 60.78 61.26 64.14
RSM PGD 7.56 7.62 9.00 10.63 68.45 67.08 67.13 69.09
Verified 19.22 19.33 20.84 23.14 72.15 71.90 72.45 73.43

7-layer CNN

Standard 2.31 2.32 2.32 8.06 56.69 70.78 84.05 90.0
PM PGD 5.71 9.72 99.93 81.91 68.75 82.63 84.88 90.0
Verified 99.01 84.77 100.0 100.0 68.83 79.54 86.19 90.0
Standard 2.27 1.85 2.05 3.37 2.11 56.08 54.59 55.33 55.24 90.0
PM PGD 6.19 3.52 3.93 4.85 3.61 68.65 68.46 67.59 67.21 90.0
Verified 12.2 12.25 11.96 14.51 12.06 68.66 68.5 68.49 69.6 90.0
Standard 2.37 72.32 3.01 2.76 55.11 55.41 56.96 59.14
RSM PGD 4.11 3.88 4.21 4.75 67.17 68.26 68.30 67.04
Verified 12.54 12.39 12.6 12.83 68.60 68.84 69.12 71.21

Resnet

Standard 3.68 6.51 4.09 62.06 55.68 88.06 90.0 90.0
PM PGD 6.03 8.98 5.74 64.36 65.24 89.41 90.0 90.0
Verified 16.62 20.97 19.34 82.02 70.41 89.44 90.0 90.0
Standard 3.65 3.24 73.23 3.17 3.12 53.6 53.38 55.17 90.0 90.0
PM PGD 9.11 4.22 4.16 4.04 3.98 63.83 64.37 66.25 90.0 90.0
Verified 14.30 13.45 13.52 13.34 13.80 68.58 68.57 69.28 90.0 90.0
Standard 3.54 3.91 3.88 4.95 55.36 56.30 57.99 60.31
RSM PGD 4.53 5.05 5.16 6.77 66.55 67.44 67.04 68.88
Verified 15.5 15.06 15.39 17.74 68.95 69.33 70.14 71.38
Table 7: Error Metrics for 4-layer CNN, 7-layer CNN, and Resnet trained for MNIST ( and ) and CIFAR ( and ) at different sparsity amounts

e.2 Training PM vs RSM

As discussed in Sec. 5, the effectivenes of SparseVLR is established by comparing the sparse models obtained by applying Algorithm 1 (steps 3-9) to different seed networks: PM (Pruned Model) and RSM (Random Sparse Model). Tab. 7 compares the sparse models obatained using three mechanism: PM (Global Pruning), PM(Retaining PM using Algorithm 1 steps 3-9.) and RSM(obatined using SparseVLR). The results are shown for 3 backbone architectures: 4-layer CNN, 7-layer CNN and Resnet at 4 sparsity amounts: 80%, 90%, 95% and 99%; tranined for two datasets MNIST and CIFAR-10. It can be noted that RSMoutperforms both PM and PMespecially at high sparsity amount such at 99%.

To investigate the difference in losses encountered while training PMPMand RSMRSMat 99% sparsity, Fig. 7 (same as Fig. 2) shows the difference in loss evolution for both the scenarios. It can be noted that PMPMincurs higher loss than RSMRSM, thus resulting in higher errors.

(a) Sparsity = 95%
(b) Sparsity = 99%
Figure 7: Variation of training loss for CIFAR-10 dataset on 4-layer CNN with at two sparsity amounts (a) 95% and (b) 99%

Interestingly, at 99% sparsity, the loss evolution for PMPMtraining does not follow the expected pattern, i.e. an initial decrease followed by an increase (see Fig. 7). The reason for this can be attributed to poor gradient flow while training PMPMat 99% sparsity (as shown in Fig. 3(b) in Sec. 5). The low magnitude of gradient results in low magnitude in the newly activated parameters, which eventually get removed during the pruning step. On the other hand, the already active parameters, which are significantly high in magnitudes, tend to become more polarized, thus, resulting in over-fitting. However, with an initial high loss, the gradient flow increases enough to generate some newly activated parameters with sufficient magnitude to replace the already activated parameters, which results in some decrease and eventually stabilization of the loss.

Appendix F Application to NLP for sentiment analysis

As discussed in Appendix A, for the Sentiment Analysis dataset, the perturbations are introduced in terms of synonym-based word substitution in a sentence, and the number of words substituted is called the budget . The value of used in the empirical analysis is 6.

Notably, this training does not use -norm pertubation, but to keep the perturbation amount continuous, the training procedure consists of an initial warm-up phase [auto-lirpa]. If the word at index for a clean sentence and a perturbed sentence are represented by and , and is the embedding for the word , the effective embedding used during training is given by:

The value of is gradually increased form to during the training phase. The results thus produced for the original dense model (sparsity = 0%) are in compliance with state-of-the-art [auto-lirpa]. Table 8 shows that the sparse models obtained at different compression amounts using the presented approach exhibit standard and verified errors comparable to the original dense model. PGD attack is based on -norm perturbation which is not applicable to this domain, so the results shown in Table 8 does not include results for PGD error.

Sparsity 0% 80% 90% 95% 99%
Error
Standard 20.92 20.26 22.24 21.11 21.53
Verified 24.27 22.35 23.94 23.53 23.72
Table 8: Error metrics for sparse models obtained using our approach using LSTM as the backbone architecture trained for NLP dataset at different sparsity amounts

Appendix G Extreme sparsity:

The results presented in Tab. 2 show that the 99% sparse models generalize comparable to the dense model, thus, indicating the possiblity of exploring further sparsity. However, at 99.9% sparsity the 7-layer CNN exhibit an increment of approximately 7% in standard error and 10% in verified error for both MNIST and CIFAR. This suggests that an optimal sparsity exists in the range of (99% - 99%) at which the sparse model generalizes comparable to its dense counterpart.

Appendix H Computation time benefit

In addition to reduction of non-zero (active) parameters of a model, we observe that the the sparse models obtained using our approach incur much less computation time. We measure the computation time in terms of average inference time on NVIDIA RTX A6000 required per sample in a dataset. The computation time shown in Table 9 are computed for CIFAR-10 dataset for three models: 4 layer CNN, 7 layer CNN and Resnet at different sparsity amounts. It can be noted that the inference time required by the sparse models obtained using SparseVLR is on an average 5 times less than the inference time of the dense models.

Compression 0% 80% 90% 95% 99%
Model
4 Layer CNN
7 Layer CNN
Resnet
Table 9: Inference Time (in sec) at different sparsity amounts