Verification of Neural Networks: Enhancing Scalability through Pruning

Verification of deep neural networks has witnessed a recent surge of interest, fueled by success stories in diverse domains and by abreast concerns about safety and security in envisaged applications. Complexity and sheer size of such networks are challenging for automated formal verification techniques which, on the other hand, could ease the adoption of deep networks in safety- and security-critical contexts. In this paper we focus on enabling state-of-the-art verification tools to deal with neural networks of some practical interest. We propose a new training pipeline based on network pruning with the goal of striking a balance between maintaining accuracy and robustness while making the resulting networks amenable to formal analysis. The results of our experiments with a portfolio of pruning algorithms and verification tools show that our approach is successful for the kind of networks we consider and for some combinations of pruning and verification techniques, thus bringing deep neural networks closer to the reach of formally-grounded methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/21/2021

Introduction to Neural Network Verification

Deep learning has transformed the way we think of software and what it c...
11/18/2020

NeVer 2.0: Learning, Verification and Repair of Deep Neural Networks

In this work, we present an early prototype of NeVer 2.0, a new system f...
05/25/2018

Automated Verification of Neural Networks: Advances, Challenges and Perspectives

Neural networks are one of the most investigated and widely used techniq...
05/08/2022

VPN: Verification of Poisoning in Neural Networks

Neural networks are successfully used in a variety of applications, many...
03/02/2022

Neuro-Symbolic Verification of Deep Neural Networks

Formal verification has emerged as a powerful approach to ensure the saf...
10/25/2019

Simplifying Neural Networks with the Marabou Verification Engine

Deep neural network (DNN) verification is an emerging field, with divers...
06/08/2020

Global Robustness Verification Networks

The wide deployment of deep neural networks, though achieving great succ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Verification of neural networks is currently a hot topic involving different areas of AI across machine learning, constraint programming, heuristic search and automated reasoning. A recent extensive survey 

[huang2018safety] cites more than 200 papers, most of which have been published in the last few years, and more contributions are coming out at an impressive rate — see, e.g., [DBLP:conf/hybrid/DuttaCJST19, DBLP:conf/cav/KatzHIJLLSTWZDK19, DBLP:journals/corr/abs-1807-03571, DBLP:conf/aaai/NarodytskaKRSW18, DBLP:journals/corr/LomuscioM17, DBLP:conf/nips/WangPWYJ18] to cite only some. Most of the current literature focuses on the verification of Deep Neural Networks (DNNs): while their application in various domains [DBLP:journals/nature/LeCunBH15] have made them one of the most popular machine-learned models to date, concerns about their vulnerability to adversarial perturbations [DBLP:journals/corr/SzegedyZSBEGF13, DBLP:journals/corr/GoodfellowSS14] have been accompanying them since their initial adoption, to the point of restraining their application in safety- and security-related contexts.

In this paper we present evidence that state-of-the-art verification tools can deal with DNNs of some practical interest through preprocessing based on statistical techniques. We propose a new training pipeline that leverages network pruning — a controlled reduction of the size of neural networks to eliminate portions that are deemed not crucial for their performances [DBLP:conf/icnn/SietsmaD88]

— to produce easier-to-verify networks. Our goal is to strike a balance between maintaining accuracy, generalization power and robustness of our networks, while making them amenable to formal analysis. In particular, we consider two mainstream pruning techniques, namely neuron pruning 

[DBLP:conf/icnn/SietsmaD88] and weight pruning [DBLP:conf/nips/CunDS89]. Intuitively, the former technique attempts to make DNNs leaner by removing neurons, and thus severing all connections across different layers going through pruned neurons; the latter acts on connections, removing neurons only when all connections through a neuron are zeroed out. Both techniques make the network graph sparser, so the key idea behind our approach is that, as long as sparse networks retain acceptable performances, their verification might be easier than that of their dense counterparts [DBLP:conf/iclr/XiaoTSM19], enabling deployment in contexts where formal guarantees are of the essence, and smaller networks are a welcome bonus.

To put our idea to the test, we consider fully-connected DNNs trained on well-known datasets — namely MNIST [lecun1998gradient] and Fashion MNIST [DBLP:journals/corr/abs-1708-07747] — and a portfolio of verification tools to assess the gain brought by pruning to the verification phase. Among the plethora of verification methods available [huang2018safety], we focus on the kind surveyed in [DBLP:journals/corr/abs-1805-09938] characterized by a “push button” approach in which, given input pre-conditions and output post-conditions stated as constraints in some formal logic, a tool is asked to answer an entailment query in such logic, i.e., whether the post-conditions are satisfied given that pre-conditions hold. We further focus on tools that are distributed as system prototypes, and assemble our portfolio considering diverse approaches within the class we focus on. In particular, we consider Marabou [DBLP:conf/cav/KatzHIJLLSTWZDK19], ERAN [DBLP:conf/iclr/SinghGPV19] and MIPVerify [DBLP:conf/iclr/TjengXT19]

. Marabou transforms the entailment query into constraint satisfaction problems solved with a dedicated satisfiabilty modulo theory (SMT) solver. ERAN tackles the certification of neural networks against adversarial perturbations by combining over-approximation methods with precise mixed integer linear programming (MILP). Finally, MIPVerify is a robustness analyzer completely based on a compilation to MILP.

Through the combination of weight and neuron pruning and our selection of verification tools we can observe the following results:

  • Without network pruning, the verification of relatively small fully connected networks — about 50K connections on 896 neurons — is mostly out of the reach of complete verification tools within a timeout of 10 CPU minutes.

  • Neuron pruning is a stronger enabler than weight pruning; all tools benefit from it, and for some combinations of parameters and verification tools, neuron pruning enables the verification of all the network instances that we consider within the timeout.

  • Weight pruning appears to be only effective for small networks and high magnitude of pruning (e.g., of the total weights). In particular ERAN, both in its complete and incomplete version seems to be able to reap more benefits of weight pruning than MIPVerify and Marabou.

While our findings are related to fully-connected networks trained and tested on specific datasets, we believe that similar advantages can be obtained considering more complex feed-forward architectures including, e.g., convolutional or residual layers, and other standard datasets which will be the subject of our continuing research.

The rest of the paper is structured as follows. In Section 2 we introduce some basic notation and definitions to be used throughout the paper. In Section 3 we introduce our approach, detailing the algorithms for different kinds of pruning and other modifications required to make networks amenable to formal analysis. In Section 4 we describe the datasets on which our experimental analysis is based, and the specific conditions that we asked the verification tools to check. In Section 5 we present the setup and the results of our experimental analysis. We conclude the paper in Section 6 with some final remarks and our agenda for future research.

2 Background

A neural network is a system of interconnected computing units called neurons. In fully connected feed-forward

networks, neurons are arranged in disjoint layers, with each layer being fully connected only with the next one, and without connection between neurons in the same layer. Given a feed-forward neural network

with layers, we denote the -th layer of as . We call a layer without incoming connections input layer , a layer without outgoing connections output layer , while all other layers are referred to as hidden layers . Each hidden layer performs specific transformations on the inputs it receives. In this work we consider hidden layers that make use of linear and batch normalization

modules. Given an input vector

, a linear module computes a linear combination of its values as follows:

(1)

where is the matrix of weights and is the vector of the biases associated with the linear module in the -th layer and is the corresponding output. Entries of both and are learned parameters. In our target architectures, each linear module is followed by a batch normalization module. This is done to address the so-called internal covariate shift problem, i.e., the change of the distribution of each layer’s input during training [DBLP:conf/icml/IoffeS15]. The mathematical formulation of batch normalization layers can be expressed as

(2)

All the operators in this equation are element-wise operators: in particular and the fractional symbol represent respectively the Hadamard product and division. and are the output and the input vectors of the module, respectively. , , , are vectors, whereas is a scalar value. These are learned parameters of the batch normalization layer. In particular and

are the estimated mean and variance of the inputs computed during training.


Finally, the output of hidden layer is computed as , where is the activation function associated to the neurons in the layer. We consider only networks utilizing Rectified Linear Unit (ReLU)activation functions, i.e., . Given an input vector , the network computes an output vector by means of the following computations

(3)

A neural network can be considered as a non-linear function , where is the input space of the network, is the output space and is the vector representing the weights of all the connections. We consider neural network applied to classification of -dimensional vectors of real numbers, i.e., and , where is the dimension of the input vector and is the dimension of the output vector and thus also the number of possible classes of interest. We assume that given an input sample the output vector contains the likelihood that belongs to one of the classes. The specific class can be computed as

where denotes the -th element of . Training of (deep) neural networks poses substantial computational challenges since for state-of-the-art models the size of can be in the order of millions. As in any machine learning task, training must select weights so as to maximize the likelihood that the networks responds correctly, i.e., if the input is of class , the chance of misclassification should be as small as possible, where misclassification occurs whenever the following holds

Training can be achieved through minimization of some kind of loss function

whose value is low when the chance of misclassification is also low. While there are many different kinds of loss functions, in general they are structured in the following way:

(4)

where is the number of training pairs , is the correct class label of , represents the loss caused by misclassification, is a regularization function, and is the parameter controlling the effect of on . The regularization function is needed to avoid overfitting, i.e., high variance of the training results with respect to the training data. The regularization function usually penalizes models with high complexity by smoothing out sharp variations induced in the trained network by the function. A common regularization function is, for example, the L2 norm:

(5)

3 Pruning for verification: training pipeline

In this Section we present the core algorithms of our approach based on network pruning (NP) and weight pruning (WP). Concerning NP, its aim is to remove neurons from the network, together with the corresponding input and output connections. Algorithm 1 reports our approach based on NP techniques that we adapted from those proposed in [DBLP:conf/iccv/LiuLSHYZ17]

for convolutional neural networks. Looking at the pseudocode in Algorithm 

1, we can see that the procedure takes as input (Line 1) the NN to train (), the sparsity rate () — which indicates the percentage of neurons to prune — and a regularizer (); the pruned network is returned as output (Line 8). The first step of the algorithm consists in training the network (Line 2) using a loss function with regularization element () which encourages to nullify the parameters of the batch normalization layers. The regularization term we consider in sparseTraining can be formalized as , referring to the network topology presented in Section 2. Each of these parameters identifies a neuron of the previous linear layer. In Line 3 and 4, we extract the weight parameters (i.e., ) of the batch normalization layers and we sort them in ascending order. In Line 5, we select as threshold the weight with index equals to the number of neurons we want to prune (given by the Sparsity Rate multiplied for the number of weights): in this way the number of weights below the threshold corresponds exactly to the number of neuron we want to prune. pruneNeurons (Line 6) applies neuron pruning by eliminating all the neurons corresponding to the weights of the batch normalization layers which are below the threshold computed in Line 5. Finally, the pruned network is trained again in order to remedy to the loss of accuracy caused by pruning (fineTune, Line 7).

1:function neuronPruning(, , )
2:      sparseTraining(, )
3:      extractWeightsBatchnorm()
4:      incOrder() 
5:      len()
6:      pruneNeurons(, )
7:      fineTune()
8:     return
9:end function
Algorithm 1 Neuron Pruning
1:function weightPruning(, )
2:      Training()
3:      extractWeightsLinear()
4:      incOrder()
5:      len()
6:      pruneWeights(, )
7:      fineTune()
8:     return
9:end function
Algorithm 2 Weight Pruning

Considering WP, the general aim of this technique is to eliminate weights, i.e., set them to . Also in this case, weights are pruned if their values after training are below a user-specified threshold. Algorithm 2 shows the pseudocode of our approach inspired by [DBLP:journals/corr/HanPTD15]. In Line 2 we train a NN using a standard training procedure. In lines 3 and 4, the weights of the linear layers are extracted from the original network and then are sorted in ascending order. In Line 5 we select the threshold from ordered weights based on the Sparsity Rate parameter , in the same way as we do in the NP algorithm. Then, pruneWeights (Line 6) sets to all the weights of the network of interest which are smaller than the threshold, hence obtaining the pruned network . Analogously to what we do for NP, at the end of the procedure (Line 7) we fine tune the pruned network.

Encoding batch normalization layers as linear layers

Not all the verification tools accept networks with batch normalization layers. For this reason, our final models need to be devoid of such modules, but training and neuron pruning require those layers to function properly. To overcome this hurdle, we propose a post-processing technique to merge together a batch normalization layer and the previous linear layer in a new linear layer. Merging is performed after training and pruning. To see how merging is performed, let us consider again the expressions for a generic linear layer and its subsequent batch normalization layer:

(6)

By substituting the value of in the second equation with the value given by the first one we obtain:

(7)

where

represent the identity matrix with dimension

, and all the other symbols have the same meaning of the corresponding ones in equations 1 and 2. Therefore we can express the operations performed by the two layers by means of a new linear layer defined with the following parameters

(8)

4 Case studies

In our experimental analysis we consider networks trained on MNIST [lecun1998gradient] and Fashion MNIST (FMNIST) [DBLP:journals/corr/abs-1708-07747] datasets. Both datasets consist of 70000 grayscale images of

pixels divided up in 10 different classes, each class represented by the same number of images. The datasets are both divided in a training set of 60000 images and a test set of 10000 images. The difference between the two datasets is that MNIST images represent handwritten digits, whereas FMNIST images represent fashion articles. While the two datasets are similar, it turns out that training neural classifiers for FMNIST is harder than solving the same task for MNIST, as testified in 

[DBLP:journals/corr/abs-1708-07747].

To perform verification of the trained networks, we consider the absence of -bounded adversarial examples as our main requirement. We recall that the infinity norm, denoted as , is defined for any vector as

Besides simplicity and generality — the existence of -bounded adversarial examples can be easily stated in all the verification tools we consider — the absence or timely detection of such adversarial examples is relevant for practical applications [DBLP:journals/corr/SzegedyZSBEGF13]. We can formalize absence of -bounded adversarial examples for networks performing image classification tasks in the following way. Let be be a neural network for image classification as defined in Section 2. Given an input image and a bound on the infinity norm, we can express a targeted -bounded adversarial example of class as an image such that

(9)

where “” denotes logical implication and is the class assigned by the network to the adversarial . The corresponding untargeted adversarial example is an image such that

(10)

where is the correct label for the image . In the following we speak of targeted and untargeted adversarial examples meaning the -bounded versions defined above, and we describe in detail how we encoded the search for adversarial examples in the tools that we consider.

The encoding of targeted adversarial examples for Marabou is straightforward. Considering a network with and , given an input image , the target adversarial class and the bound on the infinity norm we can encode (9) as:

(11)

where denotes the -th component of vector , with and with are search variables, tagged as input and output variables, respectively: Marabou connects input and output variables through the encoding of performed automatically — we refer to [DBLP:conf/cav/KatzHIJLLSTWZDK19] for more details. The encoding of the untargeted adversarial search is currently beyond our capability because Marabou does not offer an easy way to encode disjunction of constraints like that would be required to express condition (10). For this reason, we consider only targeted adversarial search in the experiments with Marabou.

ERAN provides untargeted adversarial example search only. The encoding of the property and the network is performed automatically by the tool, given the network and an image . In particular, ERAN uses an abstract interpretation approach in order to compute a symbolic overapproximation of the concrete outputs of the neural network , given a symbolic overapproximation of the set of possible inputs. The overapproximation is such that as long as , i.e., all concrete outputs are accounted for, as long as the concrete input is contained within the boundaries of the input overapproximation. In the case of bounded adversarial search we have that contains all the vectors such that : the corresponding output approximation may contain values for which the likelihood of the correct label is smaller than some other label. ERAN features two versions: an incomplete one, where the output approximation is inspected to check whether the network is safe with respect to adversarial; a complete version, where ERAN takes one more step and calls a MILP solver in order to find and output a concrete adversarial example. More details about ERAN can be found in [DBLP:conf/iclr/SinghGPV19].

MIPVerify can search for both targeted and untargeted adversarial examples and it also automatically generates its internal encoding given the network of interest and a starting input image. In particular MIPVerify encodes the search of the untargeted adversarial example as the following MILP problem:

(12)

As can be seen from (LABEL:eq:mip-advsearch), MIPVerify searches for the adversarial example which is closer to the original image. The encoding of the network is based on the formulation of piece-wise linear functions as conjunctions of MILP constraints. For an in-depth explanation of the complete encoding we refer to [DBLP:conf/iclr/TjengXT19].

5 Experiments

5.1 Setup

We consider two different baseline architectures: both of them consists of an input layer with neurons and an output layer with neurons. They both present three fully connected hidden layers with , , neurons and , ,

neurons respectively. We will identify them as NET1 and NET2 respectively.  Each hidden layer is followed by a batch normalization layer and activation functions are Rectified Linear Units (ReLUs). Networks are trained using an SGD optimizer with Nesterov momentum 

[DBLP:conf/icml/SutskeverMDH13]. The learning parameters are:

  • learning rate , defining how quickly the model replaces the concepts it has learned with new ones, i.e., controlling how much the optimizer can modify the network weights at each iteration;

  • momentum , explained in [DBLP:conf/icml/SutskeverMDH13], to control the gradient evolution in the attempt to escape local minima;

  • weight decay , i.e., the magnitude of the regularization;

  • batch size , defining the number of samples considered before updating the model.

We also use a learning rate scheduler which multiplies the learning rate with a factor of whenever the loss stops decreasing for more than

consecutive epochs. The learning proceeds for

epochs unless the loss stops decreasing for more than

consecutive epochs, in which case it terminates. All the learning and pruning algorithm we present are implemented using the learning framework PyTorch 

[paszke2017automatic] with GPU-intensive computation running on top of Google Colaboratory service.

Param Neuron SR Weight SR Regularizer
SET1
SET2
SET3
Table 1: Sets of parameters considered in our experiment. Param represents the identifiers of the different parameter sets. Neuron SR and Weight SR represent the sparsity rate (SR) for neuron pruning and weight pruning, respectively. Regularizer represent the regularizer value for sparse training in neuron pruning. Neuron SR for SET3 is for NET1 and for NET2.

Table 1 shows the three different sets of pruning parameters considered in our experiments. Neuron SR and Weight SR are the sparsity rates supplied to the neuron pruning and weight pruning procedures. We test three different sets of parameter to analyze how the performances of the verification tools change with respect to different amounts of pruning, with SET1 being the most aggressive between the three. Our implementation of weight pruning is based on the code described in [DBLP:conf/iclr/LiuSZHD19], and our implementation of neuron pruning is based on the code described in [DBLP:conf/iccv/LiuLSHYZ17]. The setup presented is used for the networks trained on MNIST as well as FMNIST.

For each baseline network and dataset we test the following:

  • Baseline Network: this is the network trained using the standard training method without the regularizer .

  • Sparse Network: this is the network trained using the sparse training method with the regularizer .

  • NP Network: this is the network trained with Algorithm 1 presented in Section 3.

  • WP Network: this is the network trained with Algorithm 2 presented in Section 3.

For each such network we compute the accuracy with respect to the original test set and the robustness using the tool Foolbox [DBLP:journals/corr/RauberBB17]. In particular we consider the bounded adversarial attack Gradient Sign Method (GSM) [DBLP:journals/corr/GoodfellowSS14] in order to compute adversarial examples. We use this method to assemble twenty images for which an adversarial example can be found or can not be found by Foolbox on all the variant of the network of interest (i.e., the adversarial is found on the same image for Baseline, Sparse, NP and WP networks or it is not found for none of them.) — notice that this does not mean that an adversarial does not exist, because GSM is sound but not complete with respect to the set of all bounded adversarial examples of a given network. The images resulting from this process are the ones fed to verification tools in order to test their performances. For all our experiments, we consider as the bound on the infinity norm. The tools are tested with a timeout of seconds for each adversarial example, running ERAN in its complete verification version.

One last remark about the setup concerns how we run the tools on different problems. In particular, for Marabou we test the images for which an adversarial example was found by searching for an adversarial example of the same class found by Foolbox, whereas we test only one image for which the adversarial was not found, but considering all the possible classes, one by one, for the adversarial example. Because of this, we consider 19 instances of the adversarial search, as opposed to the 20 instances considered for the other tools. For ERAN and MIPVerify we consider untargeted adversarial example search only.

5.2 Results

MNIST
Base Param Network Accuracy Robustness
NET1 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
NET2 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
Table 2: Accuracy and Robustness of the networks of interest for the dataset MNIST. See Table 1 for the meaning of Param. The column Network represent the different networks we have considered in our experiments. The column Accuracy represent the accuracy of the networks computed on the test set. Robustness represent the robustness of the networks computed using Foolbox. Base represent which baseline architecture was used.
FMNIST
Base Param Network Accuracy Robustness
NET1 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
NET2 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
Table 3: Accuracy and Robustness of the networks of interest for the dataset FMNIST. See Table 2 for the meaning of the columns.
Base Param Network # HL Neurons # Weights
NET1 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
NET2 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
Table 4: Number of hidden neurons and of weights of the networks of interest. These values are valid for both the MNIST and FMNIST datasets. For the meaning of columns Baseline, Param and Network see Table 2. # HL Neurons represent the number of neurons of the hidden layers of the network, # Weights represent the number of connection in the network.

Accuracy and robustness

Tables 2 and 3 show the results of our analysis about the accuracy and the robustness of the networks trained on MNIST and FMNIST, respectively. The purpose of these data is to show that our training pipeline does robustness and accuracy of our networks do not differ from baseline architectures in a substantial way, which makes the approach of running verification to certify pruned networks useful in practice. Should our networks result in unsatisfactory accuracy or robustness rates, it would not be interesting to subject them to further analysis. As it can be observed, both the weight-pruned networks and the neuron-pruned ones present a slight decrease in accuracy which correlates positively with the strength of pruning, i.e., the larger the sparsity rate, the higher the loss in accuracy. The results related to robustness are less clear cut: on MNIST, robustness decreases in neuron pruned networks, but it increases in weight pruned ones; on FMNIST dataset the opposite appears to be true. Given this discrepancy, we believe that further investigation on pruning techniques — which is beyond the scope of this paper — might be in order. Here, we just mention that decrease in robustness has been linked to over-pruning in [DBLP:conf/nips/GuoZZC18], where networks are considered to be over-pruned whenever they are more than 50 times smaller than their baseline; still in [DBLP:conf/nips/GuoZZC18], more modest levels of pruning are linked with an increase in robustness. To quantify the impact of our procedures on the resulting architectures, we report the number of neurons and weights surviving in the hidden layers of the pruned networks in Table 4. As it can be observed, networks pruned with NP present a great number of pruned connections besides the expected pruned neurons, but this still does not explain the discrepancy observed in robustness. Another explanation could be that the fine-tuning performed in our pruning procedures is not oriented to achieve robustness — which could be done using techniques such as robust training [DBLP:conf/nips/WongSMK18]. In this case, the robustness of the resulting network could be improved by using the proper training procedures and the overall results could be made more predictable than our current setting.

MNIST Base Param Network Marabou MIPVerify ERAN NET1 Baseline SET1 Sparse WP NP SET2 Sparse WP NP SET3 Sparse WP NP NET2 Baseline SET1 Sparse WP NP SET2 Sparse WP NP SET3 Sparse WP NP FMNIST Base Param Network Marabou MIPVerify ERAN NET1 Baseline SET1 Sparse WP NP SET2 Sparse WP NP SET3 Sparse WP NP NET2 Baseline SET1 Sparse WP NP SET2 Sparse WP NP SET3 Sparse WP NP
Table 5: Results of our experiment for Marabou, ERAN and MIPVerify. The values reported represent the number of problems which were solved successfully within the timeout. See Table 3 for the meaning of the columns Param, Network and Base. Marabou, MIPVerify and ERAN represent the number of problems solved by Marabou, MIPVerify and ERAN, respectively.

Impact on verification

Table 5 reports the results of our experiments with the verification tools we consider. In particular we represent the number of problems, i.e., adversarial searches, that each tool managed to complete successfully with respect to the different datasets, sets of parameters and sets of images. It should be noted that the number of solvable problems is always 20, except for Marabou, where the number of problems is 19. As it can be observed, the general trend for all the tools is that neuron-pruned networks are verified more easily than other configurations, in particular when the pruning is made more aggressive. However, there are some exceptions to this trend. For instance, we could observe that when NET2 networks are strongly pruned (i.e., NET2 and SET1), MIPVerify is able to solve more instances on the sparse networks than on the neuron pruned ones.

Weight pruning appears to be consistently less effective than neuron pruning and, in the case of strongly pruned NET2 networks, even than the sparse training. It should be noted that for the SET1 level of pruning Marabou and MIPVerify present unexpected behaviours: in some cases Marabou terminates immediately reporting that the verification query was not satisfiable, whereas MIPVerify returns an array of Not a Number. The experiments in which the tools were subject to the above mentioned problem are identified with a beside the number of instances verified (the instances in which the problem arised were considered as timed out). We believe these behaviours might be the result of numerical instabilities caused by matrix sparsity. In the case of MIPVerify, this hypothesis has been confirmed by its developers. Gurobi, the underlying solver used by MIPVerify, is affected by the bad conditioning of the sparse matrices our procedures produce.

In Table 5 we do not report runtimes on the solved instances, but if we consider them then we see that Marabou and MIPVerify have comparable performances: Marabou is better than MIpVerify on FMNIST but it is worse on MNIST. On the other hand, ERAN seems to be consistently faster than Marabou and MIPVerify. As we expected, all the tools appear to be, on average, faster on networks which are pruned more aggressively.

Effects of WP and NP on complete methods

We now turn our attention to the different effects NP and WP had on the performances of verification tools. We postulate that the observed differences can be explained by considering the techniques at the core of the verification algorithms considered. Indeed, all three tools resort to constraint-based methodologies (SMT and MILP) whose performances are sensitive to the number of variables and constraints to be found in the encoding of the original problem. The compilation of the non-linear activation functions requires a (potentially large) number of variables and constraints to represent them, and thus reducing neurons has a direct impact on the sheer complexity of the encoding. On the other hand, weight pruning eliminates mostly arithmetical operations but leaves the number of variables and constraints mostly unchanged. To support these hypothesis we have performed some additional experimental analysis on the version of ERAN relying on abstract interpretation only — the results are presented in Table 6. In this “incomplete” version, ERAN does not make use of a MILP solver in order to find the adversarial example, but it just analyses the robustness of the network through sound overapproximations. ERAN-incomplete is able to manage networks more efficiently than Marabou and MIPVerifiy, but abstract adversarial examples may no longer succeed to map to concrete ones. As it can be seen in Table 6, ERAN-incomplete manages to verify weight-pruned networks in less time than neuron-pruned ones, and the speed-up appears to be directly related to the number of weights pruned.

Regarding the results obtained for the sparse networks when heavily pruned, we believe that MIPVerify and ERAN manage to leverage the presence of many connection weights set to zero in order to remove non-linearity from their MILP problems. In particular whenever a non-linear activation function presents only zero values connections in input or in output it is removed, since it would not influence the network in any way. To confirm our hypothesis we have analysed the log files of Gurobi when called by MIPVerify on the weight pruned, neuron pruned and sparse networks: as we expected the problem solved by Gurobi in the case of neuron pruned and sparse networks present less than of integer variables with respect to the problem solved on the weight pruned networks.

ERAN - incomplete
Base Param Network MNIST FMNIST
NET1 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
NET2 Baseline
SET1 Sparse
WP
NP
SET2 Sparse
WP
NP
SET3 Sparse
WP
NP
Table 6: Results of our experiment for the datasets MNIST and FMNIST on the incomplete version of ERAN. See Table 3 for the meaning of the columns Param and Network. The columns MNIST and FMNIST report the average times needed to solve a single adversarial search for MNIST and FMNIST datasets, respectively. The averages are computed on 30 different images.

6 Conclusions and Future Works

One of the main challenges with verification of (deep) neural networks is the scalability of the currently available tools. While a great deal of research effort has been put in developing new and more scalable verification tools, leveraging statistical methods from the machine learning community to ease formal analysis has been mostly ignored. In this paper we have provided evidence that the interaction between pruning methods and verification tools can be effective and enable formal analysis of networks that could not be checked otherwise. We have studied how two different pruning methods can be embedded in a training pipeline to produce networks with good accuracy and robustness, while lowering the complexity of their corresponding verification problem. As future research, we plan to extend the work presented to convolutional neural networks, considering also other pruning methods, e.g., filter pruning, and verification tools. Moreover we intend to investigate the robustness property of the pruned networks, both using different pruning methods and testing them using robust training and other analogous techniques.

References