1 Introduction
The application of neural networks (NNs) in safetycritical systems requires a framework for formal guarantees regarding their functioning [reluplex]. The brittle nature of NNs renders them vulnerable to imperceptible adversarial perturbation , which can lead to inaccurate model prediction [goodfellow2015explaining], and the actions triggered based on these incorrect predictions may have catastrophic effects [make2040031].
The empirical adversarial training for NN models relies on efficiently finding adversarial perturbations around individual samples leveraging existing adversarial attacks. However, such training cannot guarantee that no other (yet unknown) attacks would find adversarial samples in the ball of radius around the respective samples [crown].
For instance, Madry et allet@tokeneonedot [madry2018towards] used the PGD attack to generate worstcase perturbations to be used for adversarial training. Recent approaches [Wang2020Improving_pgd, pmlrv119zhang20zpgd, locallinearization_pgd] have proposed several mechanisms to improve upon the baseline adversarial training mechanism [madry2018towards], but to generate adversarial samples for training, these approaches employ PGD attack of varying strengths. However, Tjeng et allet@tokeneonedot [tjeng2018evaluating] showed that PGD does not necessarily generate the perturbations with maximum loss; thus, the models trained to minimize PGD adversarial loss were still vulnerable to other stronger attacks. On the contrary, verified local robustness training [reluplex, ibp_training, crownibp, autolirpa] guarantees the nonexistence of adversarial samples in the vicinity of the benign samples, making it essential for safetycritical systems.
Model Compression  A Necessity. The enormous number of parameters present in dense NNs challenge their deployment in resourceconstrained environments such as selfnavigation, hazard detection, etc., making it essential to generate sparser NN models. The majority of the existing works that focus on obtaining sparse NN models aim to maintain the prediction (benign) accuracy of the sparse model comparable to the dense models trained with the same objective [quantization, lowrankfactorization, han2015learning, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20]. Recently, works are aiming at the additional target of adversarial robustness for a certain attack alongside benign accuracy [sehwag2019compact, sehwag2020hydra, Lee_2022_CVPR]. This firstofitskind work extends the scope to obtaining sparse NN networks which exhibit high accuracy and verified local robustness (definition is provided in Sec. 2.2) while using significantly fewer parameters (1% of the comparable verified locally robust dense networks evaluated in literature).
Pruning is the most common form of sparsification mechanism used in literature [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra]. Our preliminary study (Sec. 3) leveraging pruning mechanisms establishes the existence of verified locally robust sparse networks as subnetworks of dense verified locally robust models. Though conventional pruning fails, pruning combined with Dynamic Sparse Training (DST) [nest] can achieve such sparse networks comparable in generalizability to their denser counterparts. In this work, we referred to model generalizability as the ability of the model to project high (benign) accuracy and verified local robustness simultaneously.
However, pruning requires a fully converged dense NN to start with, which is computationally expensive. Above mentioned findings motivate this study to adapt DST to achieve locally robust sparse networks from scratch without such a requirement. Notably, this paper’s empirical evaluations show that obtaining a sparse network from scratch results in even higher generalizability and sparsity.
In summary, the contributions of the paper are:

[noitemsep]

This firstofitskind work presents SparseVLR– an approach to train a sparse locally robust model without training a dense overparameterized model (Sec. 4). The sparse models exhibit similar generalizability but utilize significantly (up to 1%) fewer parameters than their denser counterparts evaluated in the literature.

This paper performs a detailed empirical analysis to understand the high effectiveness of SparseVLR (Sec. 5.1). It identifies that the DST and using randomly initialized sparse network as the starting network, results in high gradient flow during training, allowing the model to learn new concepts by exploring new network connections.

Our empirical study (Sec. 5) demonstrates that SparseVLR
is effective on benchmark image datasets (MNIST and CIFAR10) and model architecture combinations used in the literature, as well as on applicationrelevant datasets: SVHN,
Pedestrian Detection [pedestriandataset], and sentiment analysis NLP dataset
[maasetal2011learning].
Additionally, Sec. 2 provides relevant backgrounds and related work discussion, Sec. 6 presents evaluations on some design choices of SparseVLR, and Sec. 7 discusses the impact, limitations and futurework scope of this study. Notably, our comprehensive Appendix further discusses the datasets, hyperparameters, seeds, models, backgrounds, and detailed analysis of different aspects of SparseVLR.
2 Background and RelatedWorks
2.1 Formal Robustness Verification
Formal robustness verification of nonlinear NNs is an NPcomplete problem (discussed in Appendix B). Recent Linear Relaxation based Perturbation Analysis (LiRPA) methods can verify nonlinear NNs in polynomial time. These methods use Forward (IBP [ibp_training]), Backward (CROWN [crown], FastenedCROWN [Lyu2020FastenedCT], CROWN [alphacrown], CROWN [wang2021betacrown], DeepPoly [DeepPoly], BaB Attack [xu2021fast]) or Hybrid (CROWNIBP [crownibp]) bounding mechanisms to compute linear relaxation of a model.
Specifically, for each sample , the perturbation neighborhood is defined as ball of radius as
(1) 
LiRPA aims to compute linear approximations of the model ( represents model parameters), and provide lower and upper bounds at the output layer, such that for any sample and each class , the model output is guaranteed to follow:
(2) 
2.2 Verified Local Robustness.
A NN model, is said to be verified locally robust [fromherz2021fast] for a sample if: , scilicet for all the samples , the model is guaranteed to assign the same class as it assigns to
. LiRPA approaches measure the local robustness in terms of a margin vector
, where C is a specification matrix of size and is the number of all possible classes [ibp_training, crown, crownibp, autolirpa]. Zhang et allet@tokeneonedot [crownibp] defines for each sample as:(3) 
The entries in matrix C depend on the true class . In matrix , the row corresponding to the true class contains at all the indices. All the other rows, contain at index corresponding to the true class and at index corresponding to current class and at all the other indices. Thus, the value of is given by , which is the difference of output values of the true class with all the other classes.
represents the lower bound of the margin vector. If all the values of the are positive, , the model is said to be locally robust for sample . That implies, the model will always assign the highest output value to the true class label if a perturbation less than or equal to is induced to the sample . Thus, implies the guaranteed absence of any adversarial sample in the region of interest. Furthermore, to specify the region bounded by , we use because norm covers the largest region [Linf].
2.3 Model Training to Maximize Local Robustness
Since, a model is considered verified locally robust for a sample if all the values of , the training schemes proposed by previous approaches [ibp_training, crownibp] aim to maximize robustness of a model by maximizing the lower bound . Xu et allet@tokeneonedot [autolirpa] defines the training objective as minimizing the maximum crossentropy loss between the model output and the true label among all samples (eq. 6 in [autolirpa]). Thus,
(4) 
where is the dataset and . Intuitively, is the sum of maximum crossentry loss in the neighborhood of each sample in for perturbation amount . Wong and Kolter [wongNKolter] showed that the problem of minimizing (as defined in Eq. 4) and the problem of maximizing are dual of each other. Thus, a solution that minimizes , maximizes the values of , hence maximizes the local robustness for the sample . Since, , minimizing also maximizes benign accuracy, ergo maximizes generalizability altogether.
Xu et allet@tokeneonedot [autolirpa] computes which requires computing bounding planes and using one of the aforementioned LiRPA bounding techniques. Following Xu et allet@tokeneonedot [autolirpa], SparseVLR employs CROWNIBP hybrid approach as the bounding mechanism to optimize in Eq. 4. The perturbation amount is initially set to zero and gradually increases to according to the perturbation scheduler scheduler discussed in Sec. B.1.
Metrics for evaluations: Tab. 1 defines the metrics used to evaluate models which have been used in the previous works for formal robustness verification [ibp_training, crown, crownibp, autolirpa].
Error  Definition 

Standard  % of benign samples classified incorrectly. 
Verified  % of benign samples for which at least one value in the is negative. 
PGD  % of perturbed samples, generated using 200step PGD attack, classified incorrectly. 
2.4 Model Sparsification
Model Sparsification is the most common form of model compression, which constrains the model to use only a subset of model parameters for inference [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra]. Hoefler et allet@tokeneonedot [pruningsurvey] categorizes the model sparsification based on the stage at which sparisfication is performed: trainandsparsify: pruning a fully trained model often followed by retraining (finetuning) the retained parameters [han2015deep, zhang2018systematic, sehwag2019compact, sehwag2020hydra], sparsifyingduringtraining: mutiple rounds of trainandsparsify while increasing sparsity in every round a.k.a. iterative pruning [itertaivepruning, gatedecorator] and sparsetraining: training a sparse network using either a static [frankle2020linear, frankle2018the, Stable_lottery, savarese2020winning, pmlrv119malach20a] or dynamic [nest, itop] mask.
Recently developed dynamicmaskbased sparsetraining known as Dynamic Sparse Training (DST) [nest, itop] approaches iteratively follow growandprune paradigm to achieve an optimal sparse network. Starting from a random sparse seed network, new connections (i.e., parameters) which minimize the natural loss are added (grow step), and the least important connections are removed (prune step). DST’s higher efficacy than the staticmaskbased sparsetraining is attributed to high gradient flow allowing the model to learn an optimal sparse network for effective inference generation [DST_reason]. SparseVLR adapts the DST approach to identify and train verified locally robust sparse networks from scratch.
Parameter removal or pruning: SparseVLR uses global magnitudebased pruning, an unstructured pruning mechanism [pytorchprune], as multiple works [globalpruningsee, globalpruning1, globalpruning2, globalpruning3, globalpruning] suggest its higher efficacy in achieving more generalizable sparse networks. It also captures the importance of a layer as a whole [movementpruning]. Additionally, magnitudebased pruning approaches [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra, eie] remove the least magnitude parameters suggesting that the importance of a parameter is directly proportional to the magnitude of its weights.
There is rich literature on sparsification mechanisms and parameter removal. A detailed summary is in Appendix C.
3 Motivation: Existence of Verified Locally Robust Sparse Network
Zhang et allet@tokeneonedot [crownibp] noted that the verified local robustness training mechanism proposed by Wong and Kolter [wongNKolter] and Wong et allet@tokeneonedot [scalable_fv] induce an implicit regularization, thus, penalizing the magnitude of model parameters during training. CROWNIBP [crownibp] training incurs less regularization and shows an increasing trend in the magnitude of model parameters while training. According to our preliminary analysis discussed in Appendix D, the implicit regularization caused by CROWNIBP does penalize networks’ parameters, making them smaller as compared to naturally trained networks, causing a high fraction of the parameters to be close to zero (magnitude ). Removal of such less significant parameters has minimal impact on model generalizability, which indicates the existence of a sparse subnetwork that exhibits verified local robustness comparable to its dense counterpart.
Fig. 1 demonstrates the standard and verified errors for 4layer CNN trained to minimize defined in Eq. 4 for benchmark datasets MNIST and CIFAR10 at various compression amounts. The evaluations are shown for the () Global Pruning without any finetuning or retraining, () Conventionally finetuning a globally pruned network using a static mask, and () Retraining the pruned network using DSTbased growandprune technique [nest].
Fig. 1 shows that as the compression ratio increases , weights having higher significance on model inference become the pruning target. Thus, the model generalization, in terms of natural accuracy and robustness, drops significantly, resulting in a trivial classifier^{1}^{1}1A classifier that always predicts the same class irrespective of the input in extreme cases. Conventional finetuning () increases the standard and verified error even at small compression amounts. However, retraining using growandprune () maintains the generalizability of the sparse models at relatively high compression amounts. A comparison of () & () for 7layer CNN and Resnet is provided in Appendix E.
The effectiveness of growandprune () in retraining a pruned model can be attributed to the better gradient flow encountered during training as compared to conventions staticmaskbased finetuning (), which is in agreement with [DST_reason]. A discussion of their gradient flow and loss evolution during training is provided in Appendix E.
Moreover, according to our preliminary analysis, locally robust sparse training leveraging staticmaskbased sparsifyingduringtraining approaches such as iterative pruning [itertaivepruning] results in inferior performance.
This section establishes (1) the existence of verified locally robust sparse subnetwork in dense verified robustly trained models; and (2) growandprune training paradigm can help identify such sparse networks.
4 Verified Locally Robust Sparse Model via DynamicMaskbased Sparsetraining
Motivated by the abovediscussed findings, this paper presents a dynamicmaskbased sparsetraining mechanism to achieve a verified locally robust sparse network without training a dense overparameterized model. SparseVLR is based on growandprune paradigm [nest], which allows network connections to expand under the constraints of an underlying backbone architecture. Each parameter of the backbone model can either be in active or dormant state based on whether or not it belongs to the sparse network at a particular instant. The weight of a dormant parameter is set to zero. The percentage of parameters that are dormant is the network’s current sparsity.
4.1 Problem Definition:
For a given randomly initialized backbone architecture with parameters, the objective is to train a verified locally robust sparse network which uses only a subset of available parameters (of the backbone), that is, and is an order of magnitude times smaller than , . The sparse model thus achieved should exhibit high accuracy and verified local robustness, comparable to a verified locally robust dense network having the same backbone architecture.
4.2 SparseVLR: The Approach
Algorithm 1 describes the SparseVLR procedure. It requires a backbone architecture ; a target sparsity , the dataset ; and the maximum perturbation
. It also requires hyperparameters that vary for different datasets and includes the number of training epochs
and the inputs for the scheduler: and (discussed in Appendix A).SparseVLR starts with an untrained random sparse (i.e., seed) network while keeping parameters of the backbone as dormant (). It is an iterative procedure, and each iteration comprises two phases: Thickening: aims to explore new parameters in the backbone to maximize verified local robustness resulting in lesser sparsity, and Pruning: removes parameters that hold lesser importance for model inference to reach the target sparsity at the end of each iteration.
Seed Network (lines 12 in Algorithm 1) Given a backbone architecture, a sparse seed network is computed by randomly initializing all the backbone parameters, followed by retaining the parameters having only the top weight magnitudes. All the other parameters of the backbone are set to zero (made dormant).
Thickening (lines 47 in Algorithm 1) This phase aims to densify the sparse network by inspecting all possible parameters (within the backbone). The parameter weights, including the dormant ones, are updated according to their corresponding gradients. This is also referred to as gradient based growth phase in the literature [nest, DST_reason, itop]. However, in literature, gradients are computed for the objective of maximizing benign accuracy, that is, minimizing the standard error; in contrast, SparseVLR aims to maximize verified local robustness of the model by minimizing the training loss as defined in Eq. 4 for each minibatch . The perturbation used for computing is computed using the perturbation scheduler scheduler as discussed in Sec. B.1.
The updated network is called auxiliary network . Since there is no restriction on the capacity of the auxiliary network, it can have more than active parameters.
Pruning (line 8 in Algorithm 1) The auxiliary network obtained after the Thickening phase needs to be compressed to the target size . This phase aims to reduce the model size in terms of the number of active parameters while having minimum impact on the generalizability. As discussed in Sec. 2, to be consistent with the stateoftheart, we employ unstructured global pruning as the model sparsification mechanism [paszke2017automatic], to attain the sparse network with target size . Our sparsification uses norm of individual model parameters as the metric to decide its importance towards model inference. Notably, norm corresponds to the magnitude of the weight of the individual parameters.
However, Liu et allet@tokeneonedot [globalpruning2] also showed that global pruning could result in a disconnected network. This can be problematic for the approaches which use singleshot pruning, that is, pruning a fully trained network followed by finetuning only the retained parameters. SparseVLR dynamically explores and selects the connections which minimize the training loss, thus, eliminating the chances of receiving a disconnected network.
5 Experiments
This section empirically demonstrates: 1) The effectiveness of SparseVLR in achieving verified locally robust sparse models (Sec. 5.1). (2) The dynamic masking used by SparseVLR results in high gradient flow (Sec. 5.1(i)(ii)), which allows the model to learn new concepts (Sec. 5.1(iii)), which aligns with the finding of [DST_reason]. (3) Inheriting the weight initialization from a fully converged dense model to obtain a sparse model result in ineffective training at high sparsity (Sec. 5.1(iv)). Moreover, (4) the generalizability of SparseVLR is demonstrated by evaluating sequential networks (i.e., LSTM), and (5) finally, an evaluation is shown for practical application relevant datasets: SVHN (Street View House Numbers), Pedestrian Detection[pedestriandataset], and NLP dataset for sentiment analysis[maasetal2011learning].
Datasets and Models: To establish the empirical effectiveness of SparseVLR, we use benchmark datasets MNIST and CIFAR10, which are used in the previous works [ibp_training, crown, crownibp, scalable_fv, wongNKolter, autolirpa]. To demonstrate the versatility of the approach across different architectures, we evaluate (a) Two CNNs with varying capacity: 4layer CNN and 7layer CNN [ibp_training, crownibp]; (b) Network with skip connections: A 13layer ResNet used by [scalable_fv] (c) A sequential network: LSTM used by [autolirpa]. To demonstrate the effectiveness and generalizability of SparseVLR, evaluations are shown for applicationrelevant datasets: SVHN, Pedestrian Detection [pedestriandataset], and Sentiment Analysis [maasetal2011learning]. Further details about perturbations and hyperparameters are discussed in Appendix A.
Formal Verification Mechanisms to compute and verified error: While training a sparse network, SparseVLR employs CROWNIBP [crownibp]. In Sec. 6, we show empirically that the sparse models trained using CROWNIBP exhibit lower errors than sparse models obtained using other bounding mechanisms such as IBP [ibp_training], CROWN [crown], and FastenedCROWN [Lyu2020FastenedCT]. For evaluations, to be consistent with the stateoftheart [ibp_training, crownibp, autolirpa], the presented evaluations use IBP to compute verified error.
Nomenclature & abbreviations for the Evaluated Sparse Networks: We aim to train an optimal sparse network that exhibits high accuracy and verified robustness starting from a random sparse network as the seed. However, conventional pruning approaches generally require a fully converged dense network as the starting point. To compare, this section evaluates thickeningandpruningbased sparse model training (lines  of Algorithm 1) using two different starting/seed sparse network variations:

[noitemsep]

Random Sparse Model (RSM): This is the seed network used in SparseVLR in Sec. 4.2 and is obtained by applying global magnitudebased pruning to the randomly initialized backbone architecture.

Pruned Model (PM): PM is the same as () model in Sec. 3, obtained by applying global magnitudebased pruning to a fully converged dense model.
The final models obtained after applying Algorithm 1 (lines 39) to PM and RSM are denoted by PMand RSM. PMis similar to the () in Sec. 3.
5.1 Establishing the Effectiveness of SparseVLR:
Tab. 2 shows the results for empirical analysis of SparseVLR for the benchmark datasets MNIST and CIFAR10. The results are shown for three models: 4layer CNN, 7layer CNN, and Resnet at sparsity amounts 90%, 95%, and 99%. The models are evaluated through standard, PGD, and Verified errors as described in Sec. 2. The column corresponding to 0% sparsity represents the error metrics computed for dense models (having the same architecture as the backbone in Algorithm 1) trained using Xu et allet@tokeneonedot [autolirpa], which aligns well with the stateoftheart [crownibp, scalable_fv]. The presented results are averaged over five random initializations (seeds) on Algorithm 1.
According to the Tab. 2, RSMachieves similar evaluation metrics (within 3% verified error) as the dense (0% sparsity) models across all datasets, models, and sparsity amounts. These results establish that (i) SparseVLR can train a verified locally robust sparse model without needing a dense overparameterized network trained till convergence, and (ii) The sparse model thus obtained is comparable to the verified locally robust dense network with the same architecture as the backbone.
Also, RSMoutperforms PM(the best approach from Sec. 3). Tab. 2 shows that across all modeldataset scenarios, the error metrics for RSMare comparable to PMat smaller sparsity and at higher sparsity (e.g., 99%) RSMexhibits much less error than PM, establishing SparseVLR’s (Algorithm 1) higher efficacy compared to the conventional pruning mechanisms. Comparisons for PM, PM, and RSMare provided in Appendix E.
To investigate the reason for the better performance of SparseVLR, a detailed analysis for 4layer CNN trained for CIFAR10 is provided; however, similar patterns were observed for other models as well.
(i) Why the RSMvs. PMdisparity at higher sparsity?
Fig. 2(a) depicts the loss evolution for PMPMand RSMRSMtraining at 95% sparsity, and reveals that the losses eventually stabilize at the same level, which in turn is governed by the maximum perturbation . SparseVLR uses scheduling, which provides perturbation for initial epochs and then gradually increases the perturbation value. This results in an initial drop in losses, as seen in figures Fig. 2(a). As the perturbation amount increases, the loss increases as well.
Notably, at 99% sparsity (Fig. 2(b)), the stabilized loss for PMis significantly greater than RSM. Higher loss leads to higher error, hence, the observation. The evolution of training loss of PMis discussed further in Appendix E. The reason for the higher stabilized loss for PMat high sparsity is discussed below.
(ii) Reason for disparity in the stabilized losses for RSMand PM: Gradient flow over
Evci et allet@tokeneonedot [DST_reason] claims that different initialization may lead to different gradient flow, which in turn leads to varied accuracy of the sparse model trained via DST. To explain the disparity in stabilized losses for PMand RSMat 99% sparsity (Fig. 2b), we plot the observed gradient flows in Fig. 3(b). It depicts that the norm of gradients observed while training RSMRSMis greater than the gradient norm incurred while training PMPM, thus, allowing the model to learn new concepts (further discussed below), hence incurring a lower stabilized loss.
(iii) How poor gradient flow hinders the ability of the sparse network to change connections (i.e., learning new concepts) during training?
The effect of poor gradient flows is twofold: (i) a lesser number of parameters explored (activated) during the thickening phase, and (ii) relatively small magnitude in the newly activated parameters of the auxiliary model.
The change in the structure of a sparse network obtained after a pruning step can be quantified in terms of the number of parameters of the backbone whose state changed from active to dormant and viceversa. Fig. 4 depicts the changes in the sparse structure during training. Notably, PMPMtraining incurs almost no change, whereas the sparse structure of RSMRSMtraining is quite dynamic. This can be attributed to smaller magnitudes of newly activated parameters which become an easy target of removal during the pruning step, resulting in a similar sparse structure. It further hinders the sparse network’s ability to learn new concepts, leading to a higher loss in PMthan RSM.
(iv) Cause for inefficient learning while training PMPM Is it the structure or the weight initialization?
PM inherits parameter weights from a pretrained dense model, and the PM structure is learned through global magnitudebased pruning. This section investigates whether the lower performance of PMat high sparsity (e.g., 99%) is due to PM’s retained parameter weights or learned network structure.
For this investigation, we evaluate the Randomly Initialized Pruned Model (RIPM) as the seed network, which randomly initializes PM’s parameters while keeping the learned structure intact. Tab. 3 compares RIPM, the sparse model obtained via training RIPMRIPMusing Algorithm 1 (lines ), with PM, PM, and RSM. It can be observed that RIPMgeneralizes comparable to RSM. Also, Fig. 4 suggests that while training RIPMRIPM, the model can learn new connections (high variation in the sparse network structure) in the first half of training.
However, Fig. 3(b) shows that the gradient flow while training RIPMRIPMis only marginally better than the gradient flow observed while training PMPM. Then, why RIPMis better than PM?
Comparison of the initial seed sparse networks’ weight distribution shown in Fig. 3(a) demonstrates that the magnitude of active weights in RIPM is lower than that of PM. As we know, the poor gradient flows result in low magnitudes of newly activated parameters’ weights after the thickening phase, and the magnitude of parameter weights of RIPM (i.e., initially active) are also low; allowing newly active parameters to be selected (and replace previously active ones) during the pruning step, leading to changes in the structure of sparse network during training. Such changes allow the model to learn new concepts, leading to higher performance in the obtained RIPM.
Thus, the parameter weights inherited from a fully converged dense network are responsible for PMPM’s inefficient learning at high sparsity amounts. This finding further motivates the effectiveness of training a verified locally robust sparse network from scratch rather than pruning an overparameterized pretrained dense model.
Error  PM  PM  RSM  RIPM 

Standard  90.0  90.0  65.01  65.81 
PGD  90.0  90.90  70.43  69.85 
Verified  90.0  90.0  73.38  74.97 
5.2 Application to Sequential Model
To further investigate the application of SparseVLR to sequential networks, we evaluated a small LSTM [autolirpa] as a backbone architecture to obtain sparse models for MNIST. Tab. 4 demonstrates the standard, PGD, and verified errors for the sparse models obtained using SparseVLR. It can be noted that up to 95% sparsity amount, the sparse models generalize comparable to their dense counterparts. Additionally, higher error at 99% sparsity can be attributed to the insufficient number of parameters for the target task.
Sparsity  0%  80%  90%  95%  99% 

Error  
Standard  4.04  4.42  4.38  5.52  11.55 
PGD  9.35  11.60  10.10  13.84  17.22 
Verified  14.23  11.85  10.75  13.90  18.70 
5.3 Evaluations for ApplicationRelevant Datasets
Tab. 5 demonstrates the results for two applicationrelevant image datasets: (a) SVHN and (b) a pedestrian detection dataset that aims at differentiating between people and peoplelike objects [pedestriandataset]. The backbone architecture used for these evaluations is a 7layer CNN, adapted to the respective datasets. For both SVHN and Pedestrian detection, the error metrics of the models having 99% sparsity are comparable to that of their dense counterparts. The perturbation amount used for SVHN and Pedestrian Detection are 8/255 and 2/255, respectively. Additionally, sparse models constructed for a sentiment analysis NLP dataset[maasetal2011learning] on an LSTM model [wordformal, autolirpa] exhibit errors within 1% of the dense counterparts, showing SparseVLR’s generalizability (detailed results presented in Appendix F).
Dataset  SVHN  Pedestrian Detection  

Sparsity  0%  95%  99%  0%  95%  99% 
Error  
Standard  62.55  62.77  65.87  28.40  26.22  28.40 
PGD  68.19  68.23  70.52  42.35  39.66  43.03 
Verified  74.87  75.21  76.76  63.53  67.39  63.02 
6 Design Choices for SparseVLR
Pruning after every thickening step: Liu et allet@tokeneonedot [itop] suggests that allowing the model to grow for multiple epochs before pruning results in exploring all the parameters at least once. However, the implicit sparsification caused by the (discussed in Sec. 3) does not allow such exploration. Our empirical evaluations show that pruning step after every thickening step results in the most optimal sparse models.
Using CROWNIBP as bounding mechanism: Tab. 6 compares the errors exhibited by 4layer CNN at 99% sparsity when trained by employing different bounding mechanisms to compute : IBP [ibp_training], CROWN [crown], FastenedCROWN [Lyu2020FastenedCT], and CROWNIBP [crownibp]. Notably, empirically CROWNIBP results in the most optimal models for both MNIST and CIFAR. Hence, SparseVLR employs CROWNIBP to compute .
Dataset  Method  IBP  CROWN  CROWN  CROWN 
Error  Fast  IBP  
MNIST  Standard  7.44  52.3  13.53  7.23 
= 0.4  PGD  10.27  68.84  18.20  9.68 
Verified  21.54  93.13  42.18  21.95  
CIFAR  Standard  62.85  69.04  68.72  65.01 
= 8/255  PGD  78.21  71.53  81.69  70.43 
Verified  78.52  88.3  89.39  73.38 
7 Discussion on Impact and Limitations
Impact: SparseVLR will enable resourceconstraint platforms to leverage verified robust models. The sparse training reduces the total training time to almost half that of conventional pruning. Moreover, the inference time of the sparse models is five times lesser, on average, than their dense counterparts on NVIDIA RTX A6000 (detailed results presented in Appendix H).
Limitations and Futureworks: SparseVLR uses CROWNIBP; thus, the results obtained are restricted by the robustness achieved by CROWNIBP. Therefore, SparseVLR will need to be tailored for further advancement in verified local robustness training mechanisms. Secondly, the empirical analysis has been done for models used in the literature [crownibp, scalable_fv], and we show that sparse models exist for such architectures. However, further analysis of more complex networks and domains would be beneficial, which is out of the scope of this paper. Finally, there is rich literature on pruning. However, we only investigated global unstructured pruning mechanisms; since the focus was on training sparse models from scratch, and a rigorous investigation of pruning was out of the scope. Our preliminary results on applying staticmaskbased sparsetraining paradigm based on ‘Lottery Ticket Hypothesis’ [frankle2018the] for verified locally robust sparse NN search show inferior performance for this domain, suggesting the requirement for further investigation leveraging more sophisticated approaches for finding the lottery ticket. Finally, following most studies in literature, this paper addresses up to 99% model sparsification. However, our analysis shows on extreme sparsity (99.9%), generalizability decrease, indicating optimal sparsity is between that range (9999.9%). A detailed discussion is in Appendix G. Though finding optimal sparsity is out of scope, future work on this scope would be beneficial.
8 Conclusion
This is the first work investigating and showing that a verified locally robust sparse network is trainable from scratch, leveraging a dynamicmaskbased mechanism. The ability to train a sparse network from scratch renders the conventional requirement of training a dense network nonobligatory and reduces the training time to almost half. Additionally, the empirical analysis demonstrates SparseVLR’s generalizability, thus, enabling the deployment of verified locally robust models in resourceconstrained safetycritical systems.
References
Appendix A Experiment Details
Datasets: The empirical analysis of SparseVLR is done for 5 datasets: MNIST, CIFAR, SVHN, Pedestrian Detection and Sentiment Analysis.
MNIST is a dataset of handwritten digits with 60000 samples in the training set and 10,000 samples in the testing set. All the images are in greyscale and have a size of 2828. The training set is generated using samples from approximately 250 writers. The writers for the test set and the training set are disjoint.
CIFAR10 is a dataset of 60000 images evenly distributed among 10 mutually exclusive classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The training set and the testing set consists of 50000 and 10000 images, respectively. These RGB images are of size 32
32 each. Additionally, The test set contains 1000 randomly selected images from each class, thus having a uniform distribution over all the classes.
SVHN is a realworld dataset for digit recognition. The images are Street View House Numbers obtained from google street view images. The format used for the evaluations for SparseVLR contains 3232 images of individual digits with distractions. The dataset contains 73257 images in the training set and 26032 images for testing.
Pedestrian Detection [pedestriandataset] aims to differentiate between the person and personlike objects such as statues, scarecrows, etc., having very similar features to a person. The number of images in train, validate and test set are 944, 160, and 235, respectively, with a total of 1626 persons and 1368 nonhuman labeling.
SparseVLR is applicable to domains other than image classification. To illustrate this, an empirical study has been done on sentiment analysis NLP dataset [maasetal2011learning] used in the previous formal verification approaches by [autolirpa, wordformal]. The dataset contains 10662 sentences evenly distributed between two classes: positive and negative.
Perturbation amount: For image classification datasets, the perturbations are introduced in ball of radius (perturbation amount), (defined in Sec. 2). The highest perturbation amounts (i.e., most difficult scenarios) used in the literature addressing formal verification and training [ibp_training, crownibp, scalable_fv, autolirpa] are 0.4 and 8/255 for MNIST and CIFAR10, respectively. For CIFAR10, Zhang et allet@tokeneonedot [crownibp] uses different perturbation amounts for training () and testing (), such that , we follow the same convention. Thus, for the models that need to be tested for 8/255, the training is done for a perturbation amount of 8.8/255. For SVHN, the previous works targetting adversarial robustness [sehwag2020hydra], use perturbation amount of 8/255, thus, following the convention of CIFAR10, we use and . For Pedestrian Detection dataset, we use and .
For Sentiment analysis NLP dataset, the perturbations are introduced in terms of synonymbased word substitution in a sentence. Each word has a set of synonyms with which it can be substituted. is computed using 8 nearest neighbors in counterfitted word embedding, where the distance signifies similarity in meaning [wordformal, autolirpa]. The number of words being substituted is referred to as budget (). The maximum budget used in stateoftheart [autolirpa, wordformal] for formal verification is 6, and we use the same amount.
Hyperparameters: The presented approach requires a set of hyperparameters as inputs: , where is the total number of training epochs, is the epoch number at which the perturbation scheduler should start to increment the amount of perturbation and specifies the length of the schedule, that is, the number of epochs in which the perturbation scheduler has to reach the maximum amount of perturbation .
These values are as follows: (1) For 4layer CNN, 7layer CNN, and Resnet, we use and , for MNIST and CIFAR10, respectively; (2) For LSTM trained for MNIST, we use ; (3) For SVHN, the hyperparameters are same as that of CIFAR10. (4) For Pedestrian Detection, the hyperparameters resulted in the models displaying the least errors empirically; and (5) For Sentiment Analysis NLP dataset, the hyperparameters are used. The value of for MNIST, CIFAR10, and Sentiment Analysis are in compliance with stateoftheart [autolirpa]. For Pedestrian Detection and SVHN no prior hyperparameter values are available.
Learning rate: The training approach uses the learning rate scheduler, which gradually decreases the learning rate as the training proceeds after the maximum perturbation is reached. The starting learning is set to as suggested by the stateoftheart [autolirpa].
Appendix B Formal Verification
An NPComplete problem:Linear Programming [Bastani2016MeasuringNNlp] and Satisfiability Modulo Theory (SMT) [smtformal] have been used in literature as formal verification techniques to validate neural networks. However, Pulina and Tacchella [smtchallenge]
showed that MultiLayer Perceptrons (MLPs) cannot be verified using SMT solvers, thus, challenging the scalability of SMT solvers to neural networks of realistic sizes. Furthermore, several applications require nonlinear activation functions to learn complex decision boundaries, rendering the whole NN nonlinear. The large size and nonlinearity of NNs make the problem of formal robustness verification nonconvex and NPComplete
[reluplex, sinha2018certifiable].Relaxation based Methods: To address the challenges mentioned above, Katz et allet@tokeneonedot [reluplex]
used relaxation of ReLU activations, temporarily violating the ReLU semantics, resulting in an efficient query solver for NNs with ReLU activations. The piecewise linearity of the ReLU activations is the basis for this relaxation. LiRPA based methods such as CROWN
[crown], IBP [ibp_training], CROWNIBP [crownibp], etc. compute linear relaxation of a model containing general activation functions which are not necessarily piecewise linear.CROWN [crown] uses a backward bounding mechanism to compute linear or quadratic relaxation of the activation functions. IBP [ibp_training] is a forward bounding mechanism that propagates the perturbations induced to the input towards the output layer in a forward pass based on interval arithmetic, resulting in ranges of values for each class in the output layer. CROWNIBP [crownibp] is a hybrid approach that demonstrated that IBP can produce loose bounds, and in order to achieve tighter bounds, it uses IBP [ibp_training] in the forward bounding pass and CROWN [crown]
in the backward bounding pass to compute bounding hyperplanes for the model. CROWNIBP has been shown to compute tighter relaxations, thus, providing the most optimal results for formal verification.
As discussed in Sec. 2, the linear relaxation computed using LiRPA methods can be used to train verified locally robust models by minimizing . This learning procedure uses a perturbation scheduler to gradually increase the amount of perturbation during training.
b.1 Perturbation Scheduler
scheduler aims to provide a perturbation amount for every training epoch , which gradually increases starting at epoch s. The schedule starts with a 0 perturbation and reaches in epochs [ibp_training]. Gowal et allet@tokeneonedot [ibp_training] proposed using scheduler for training based on LiRPA and deems it necessary for an effective learning procedure. Since IBP [ibp_training] uses interval propagation arithmetic to propagate input perturbation to the output layer, where the interval size usually keeps on increasing as the propagations reach deeper in the model, the gradual increase of epsilon prevents the problem of intermediate bound explosion while training. Other related approaches also adopted the idea of gradual perturbation increment [crownibp, autolirpa]. Since the bounding mechanism used in SparseVLR is CROWNIBP [crownibp], the use of scheduler is crucial.
Appendix C Model Compression
Several model compression approaches have been proposed to reduce the computational and storage requirements of deep neural networks (DNNs). Quantization aims to convert the parameter values to lowprecision approximations requiring less storage bits per parameter [10.5555/3122009.3242044, Gong2014CompressingDC, quantization, chmiel2021neural]. Model Distillation [KD_main] and Neural Architecture Search [NAS] train a smaller dense architecture that generalizes comparable to the dense network. LowRank Factorization [lowrankfactorization] computes a matrix decomposition of the weight matrices, which requires fewer floating point operations. Model Sparsification/Pruning is the most common form of model compression, which constrains the model to use only a subset of original parameters for inference [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra].
c.1 Model sparsification.
Hoefler et allet@tokeneonedot [pruningsurvey] categorizes the model sparsification according to the stage at which sparsification is performed: trainandsparsify (After training), sparsifyingduringtraining (While Training) and sparsetraining (Before training).
Trainandsparsify approaches [han2015deep, zhang2018systematic, sehwag2019compact, sehwag2020hydra] train a dense model and remove (or prune) the parameters contributing the least towards model inference. Since the removal of parameters may lead to loss of information learned by the original dense model, the removal of parameters is often followed by retraining the retained parameters, also known as the finetuning step. Trainandsparsify associates a static binary mask with the parameters, which allows the finetuning step to update only the retained parameters.
The sparsifyingduringtraining mechanism gradually removes NN model parameters in small fractions per round. Each round is followed by a finetuning step allowing the model to recover from the parameter removal step [itertaivepruning, gatedecorator]. This is often called iterative pruning.
The sparsetraining mechanism involves training a sparse network from scratch by using either a static or a dynamic mask. Identifying static masked sparse network is motivated by Lottery Ticket Hypothesis (LTH) [frankle2018the] which states that “dense, randomlyinitialized, feedforward networks contain subnetworks (winning tickets) that  when trained in isolation  reach test accuracy comparable to the original network in a similar number of iterations”. However, finding the static mask in LTH [frankle2020linear, frankle2018the, Stable_lottery, savarese2020winning, pmlrv119malach20a] still requires a fully or partially trained dense model to start with.
Recently, dynamicmaskbased sparsetraining mechanisms known as Dynamic Sparse Training (DST) have been developed. DST approaches [nest, itop] start with a randomly initialized compressed sparse network and explore new network connections while learning network weights for the target task during training. SparseVLR adapts a DST approach to identify verified locally robust sparse network from scratch.
c.2 Selecting the parameters to be removed:
While all sparsification approaches mentioned above perform network parameters (i.e., weight) removal, selecting such parameter weights is done based on different criteria such as the magnitude of weights, magnitude of corresponding gradients, minimal impact on Hessian matrix, etc. [pruningsurvey]. Magnitudebased pruning is the most popular pruning mechanism [han2015learning, sehwag2019compact, zhang2018systematic, han2015deep, Li2017PruningFF, Sanh0R20, sehwag2020hydra, eie], which removes the least magnitude parameters suggesting that the importance of a parameter is directly proportional to the magnitude of their weights.
Moreover, removal of parameters can be done either perlayer or globally, whereas recent studies [globalpruningsee, globalpruning1, globalpruning2, globalpruning3, globalpruning] suggest the higher efficacy of global unstructured pruning. Sanh et allet@tokeneonedot [movementpruning]
demonstrated that global magnitudebased pruning tends to prune earlier layers of a network by a lesser amount. Since the earlier layers for most networks are feature extraction layers which hold more importance for a classification task, it can be concluded that global pruning captures the importance of a layer as a whole.
Appendix D CROWNIBP causes regularization
The motivation for SparseVLR is that a huge fraction of parameters of locally robust models trained using CROWNIBP have low magnitudes, so removal of such less significant parameters has minimal effect on the model generalizability (Sec. 3).
Fig. 5 shows the weight distribution of models trained to minimize natural, and verified locally robust (CROWNIBP [crownibp]) losses. The trends are shown for 4layer and 7layer CNNs trained for MNIST and CIFAR10 datasets. As evident from the distribution of weights, a higher fraction of weights in verified locally robust models have very low magnitudes (). This observation suggests that models trained to minimize (Eq. 4) penalize the parameter magnitudes.
Appendix E Explaining the generalizability difference for various sparse training mechanisms
e.1 Conventional Finetuning vs GrowandPrune
Conventional finetuning aiming to minimize (() in Sec. 3) increases the standard and verified error even at small compression amounts. Such increment can be attributed to poor gradient flow encountered during static maskbased finetuning, which is in agreement with [DST_reason]. However, retraining a pruned model using growandprune (() in Sec. 3) does not suffer from poor gradient flow, hence may help maintain the generalizability of the sparse compressed models.
To further investigate, Fig. 6 compares the losses and norm of gradients encountered while retraining a globally pruned network using conventional static maskbased finetuning (while keeping the mask the same throughout the training), and growandprune mechanism [nest]. The results are shown for a 4layer CNN trained for CIFAR10 at 90% sparsity. It is important to note that the training procedure uses scheduling, which provides perturbation for initial epochs and then gradually increases the perturbation value. This results in an initial drop in losses, as seen in Fig. 6(b). As the perturbation amount increases, the loss increases as well and finally stabilizes.
Notably, the losses encountered by finetuning are higher than retraining using growandprune, thus resulting in higher errors in the obtained compressed models (shown in Fig. 6(b)). The difference in losses encountered using the two approaches can be attributed to the difference in gradient flows (see Fig. 6(a)). Retraining using growandprune incurs higher gradient flows than conventional finetuning, and according to Evci et allet@tokeneonedot [DST_reason], higher gradient flows lead to efficient training.
Dataset  MNIST  CIFAR10  

Model 
Sparsity  0%  80%  90%  95%  99%  0%  80%  90%  95%  99%  
Type  Error  
4layer CNN 
Standard  5.11  5.98  12.98  59.83  60.0  61.28  84.33  90.0  
PM  PGD  6.75  7.84  16.03  62.19  67.3  68.05  86.43  90.0  
Verified  18.49  22.80  40.7  79.18  71.41  72.02  87.4  90.0  
Standard  5.29  4.78  4.79  4.78  6.97  59.88  58.81  59.32  60.73  90.0  
PM  PGD  10.93  6.34  6.30  6.23  8.98  67.25  68.46  66.01  66.74  90.90  
Verified  18.07  17.71  17.22  17.46  21.35  71.5  70.86  71.04  72.12  90.0  
Standard  5.67  5.71  6.59  8.09  60.83  60.78  61.26  64.14  
RSM  PGD  7.56  7.62  9.00  10.63  68.45  67.08  67.13  69.09  
Verified  19.22  19.33  20.84  23.14  72.15  71.90  72.45  73.43  
7layer CNN 
Standard  2.31  2.32  2.32  8.06  56.69  70.78  84.05  90.0  
PM  PGD  5.71  9.72  99.93  81.91  68.75  82.63  84.88  90.0  
Verified  99.01  84.77  100.0  100.0  68.83  79.54  86.19  90.0  
Standard  2.27  1.85  2.05  3.37  2.11  56.08  54.59  55.33  55.24  90.0  
PM  PGD  6.19  3.52  3.93  4.85  3.61  68.65  68.46  67.59  67.21  90.0  
Verified  12.2  12.25  11.96  14.51  12.06  68.66  68.5  68.49  69.6  90.0  
Standard  2.37  72.32  3.01  2.76  55.11  55.41  56.96  59.14  
RSM  PGD  4.11  3.88  4.21  4.75  67.17  68.26  68.30  67.04  
Verified  12.54  12.39  12.6  12.83  68.60  68.84  69.12  71.21  
Resnet 
Standard  3.68  6.51  4.09  62.06  55.68  88.06  90.0  90.0  
PM  PGD  6.03  8.98  5.74  64.36  65.24  89.41  90.0  90.0  
Verified  16.62  20.97  19.34  82.02  70.41  89.44  90.0  90.0  
Standard  3.65  3.24  73.23  3.17  3.12  53.6  53.38  55.17  90.0  90.0  
PM  PGD  9.11  4.22  4.16  4.04  3.98  63.83  64.37  66.25  90.0  90.0  
Verified  14.30  13.45  13.52  13.34  13.80  68.58  68.57  69.28  90.0  90.0  
Standard  3.54  3.91  3.88  4.95  55.36  56.30  57.99  60.31  
RSM  PGD  4.53  5.05  5.16  6.77  66.55  67.44  67.04  68.88  
Verified  15.5  15.06  15.39  17.74  68.95  69.33  70.14  71.38 
e.2 Training PM vs RSM
As discussed in Sec. 5, the effectivenes of SparseVLR is established by comparing the sparse models obtained by applying Algorithm 1 (steps 39) to different seed networks: PM (Pruned Model) and RSM (Random Sparse Model). Tab. 7 compares the sparse models obatained using three mechanism: PM (Global Pruning), PM(Retaining PM using Algorithm 1 steps 39.) and RSM(obatined using SparseVLR). The results are shown for 3 backbone architectures: 4layer CNN, 7layer CNN and Resnet at 4 sparsity amounts: 80%, 90%, 95% and 99%; tranined for two datasets MNIST and CIFAR10. It can be noted that RSMoutperforms both PM and PMespecially at high sparsity amount such at 99%.
To investigate the difference in losses encountered while training PMPMand RSMRSMat 99% sparsity, Fig. 7 (same as Fig. 2) shows the difference in loss evolution for both the scenarios. It can be noted that PMPMincurs higher loss than RSMRSM, thus resulting in higher errors.
Interestingly, at 99% sparsity, the loss evolution for PMPMtraining does not follow the expected pattern, i.e. an initial decrease followed by an increase (see Fig. 7). The reason for this can be attributed to poor gradient flow while training PMPMat 99% sparsity (as shown in Fig. 3(b) in Sec. 5). The low magnitude of gradient results in low magnitude in the newly activated parameters, which eventually get removed during the pruning step. On the other hand, the already active parameters, which are significantly high in magnitudes, tend to become more polarized, thus, resulting in overfitting. However, with an initial high loss, the gradient flow increases enough to generate some newly activated parameters with sufficient magnitude to replace the already activated parameters, which results in some decrease and eventually stabilization of the loss.
Appendix F Application to NLP for sentiment analysis
As discussed in Appendix A, for the Sentiment Analysis dataset, the perturbations are introduced in terms of synonymbased word substitution in a sentence, and the number of words substituted is called the budget . The value of used in the empirical analysis is 6.
Notably, this training does not use norm pertubation, but to keep the perturbation amount continuous, the training procedure consists of an initial warmup phase [autolirpa]. If the word at index for a clean sentence and a perturbed sentence are represented by and , and is the embedding for the word , the effective embedding used during training is given by:
The value of is gradually increased form to during the training phase. The results thus produced for the original dense model (sparsity = 0%) are in compliance with stateoftheart [autolirpa]. Table 8 shows that the sparse models obtained at different compression amounts using the presented approach exhibit standard and verified errors comparable to the original dense model. PGD attack is based on norm perturbation which is not applicable to this domain, so the results shown in Table 8 does not include results for PGD error.
Sparsity  0%  80%  90%  95%  99% 

Error  
Standard  20.92  20.26  22.24  21.11  21.53 
Verified  24.27  22.35  23.94  23.53  23.72 
Appendix G Extreme sparsity:
The results presented in Tab. 2 show that the 99% sparse models generalize comparable to the dense model, thus, indicating the possiblity of exploring further sparsity. However, at 99.9% sparsity the 7layer CNN exhibit an increment of approximately 7% in standard error and 10% in verified error for both MNIST and CIFAR. This suggests that an optimal sparsity exists in the range of (99%  99%) at which the sparse model generalizes comparable to its dense counterpart.
Appendix H Computation time benefit
In addition to reduction of nonzero (active) parameters of a model, we observe that the the sparse models obtained using our approach incur much less computation time. We measure the computation time in terms of average inference time on NVIDIA RTX A6000 required per sample in a dataset. The computation time shown in Table 9 are computed for CIFAR10 dataset for three models: 4 layer CNN, 7 layer CNN and Resnet at different sparsity amounts. It can be noted that the inference time required by the sparse models obtained using SparseVLR is on an average 5 times less than the inference time of the dense models.
Compression  0%  80%  90%  95%  99% 

Model  
4 Layer CNN  
7 Layer CNN  
Resnet 