Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming

by   Mostafa Elaraby, et al.
Université de Montréal

We introduce a novel approach to optimize the architecture of deep neural networks by identifying critical neurons and removing non-critical ones. The proposed approach utilizes a mixed integer programming (MIP) formulation of neural models which includes a continuous importance score computed for each neuron in the network. The optimization in MIP solver minimizes the number of critical neurons (i.e., with high importance score) that need to be kept for maintaining the overall accuracy of the model. Further, the proposed formulation generalizes the recently considered lottery ticket optimization by identifying multiple "lucky" sub-networks resulting in optimized architecture that not only perform well on a single dataset, but also generalize across multiple ones upon retraining of network weights. Finally, the proposed framework provides significant improvement in scalability of automatic sparsification of deep network architectures compared to previous attempts. We validate the performance and generalizability of our approach on MNIST, Fashion-MNIST, and CIFAR-10 datasets, using three different neural networks: LeNet 5 and two ReLU fully connected models.


page 1

page 2

page 3

page 4


Efficient and Robust Mixed-Integer Optimization Methods for Training Binarized Deep Neural Networks

Compared to classical deep neural networks its binarized versions can be...

ReLU Networks as Surrogate Models in Mixed-Integer Linear Programs

We consider the embedding of piecewise-linear deep neural networks (ReLU...

ReLU activated Multi-Layer Neural Networks trained with Mixed Integer Linear Programs

This paper is a case study to demonstrate that, in principle, multi-laye...

A Mixed Integer Programming Approach for Verifying Properties of Binarized Neural Networks

Many approaches for verifying input-output properties of neural networks...

A Mixed Integer Programming Approach to Training Dense Neural Networks

Artificial Neural Networks (ANNs) are prevalent machine learning models ...

Unified Receiver Design in Wireless Relay Networks Using Mixed-Integer Programming Techniques

Wireless receiver design is critical to the overall system performance. ...

StochasticNet: Forming Deep Neural Networks via Stochastic Connectivity

Deep neural networks is a branch in machine learning that has seen a met...

1 Introduction

Figure 1: The generic flow of our proposed framework used to remove neurons having an importance score less than certain threshold.

Deep learning has proven its ability to solve complex tasks and to achieve state-of-the-art results in various domains such as image classification, speech recognition, machine translation, robotics and control (Bengio et al., 2017; LeCun et al., 2015). Over-parameterized deep neural models with more parameters than the training samples can be used to achieve state-of-the art results on various tasks (Zhang et al., 2016; Neyshabur et al., 2018). However, the large number of parameters comes at the expense of computational cost in terms of memory footprint, training time and inference time on resource-limited IOT devices (Lane et al., 2015; Li et al., 2018).

In this context, pruning neurons from an over-parameterized neural model has been an active research area. This remains a challenging open problem whose solution has the potential to increase computational efficiency and to uncover potential sub-networks that can be trained effectively. Neural Network pruning techniques (LeCun et al., 1990; Hassibi et al., 1993; Han et al., 2015; Srinivas and Babu, 2015; Dong et al., 2017; Zeng and Urtasun, 2018; Lee et al., 2018; Wang et al., 2019; Salama et al., 2019; Serra et al., 2020) have been introduced to sparsify the model without loss of accuracy.

Most existing work focus on identifying redundant parameters and non-critical neurons to achieve a lossless sparsification of the neural model. The typical sparsification procedure includes training a neural model, then computing parameters importance and pruning existing parameters using certain criteria and fine-tuning the neural model to regain its lost accuracy. Furthermore, existing pruning and ranking procedures are computationally expensive, require iterations of fine-tuning on the sparsified model and no experiments were conducted to check the generalization of sparsified models across different datasets.

We remark that sparse neuron connectivity is often used by modern network architectures, and perhaps most notably in convolutional layers used in image processing. Indeed, the limited size of the parameter space in such cases increases the effectiveness of network training and enables learning of meaningful semantic features from the input images Goodfellow et al. (2016). Inspired by the benefits of sparsity in such architecture designs, we aim to leverage the neuron sparsity achieved by our framework to achieve optimized neural architecture that can generalize well across different datasets. For this purpose, we create a sparse sub-network by optimizing on one dataset and then train the same architecture (i.e., masked) on another dataset indicating a promising direction of future research into the utilization of our approach for effective automatic architecture tuning to augment handcrafted network architecture design.


In our proposed framework, illustrated in Figure 1, we formalize the notation of neuron importance as a score between 0 and 1 for a trained neural network. The neuron importance score reflects how much activity decrease can be inflicted in it, while controlling the loss on the neural network model accuracy. Concretely, we propose a mixed integer programming formulation (MIP) that allows the computation of each neuron score on a fully connected layer and that takes into account the error propagation between the different layers. The representation of trained ReLU neural networks through MIPs has previously been proposed by, e.g., Fischetti and Jo (2018); Anderson et al. (2019) to check neural model robustness. In our framework, we further advance these MIP formulations by introducing new variables quantifying each neuron importance score. The motivation to use such approach comes from the existence of powerful techniques to solve MIPs efficiently in practice, and consequently, the scalability of this procedure to large ReLU neural models. In addition, the proposed formulation can be extended from fully connected layers to convolutional layers when converted to toeplitz format (Gray, 2000).

Our MIP formulation uses a small balanced subset of data points (as small as one per class) to approximate the neurons importance. We perform computational tests to validate the robustness of this approximation by randomly choosing such subset of data points. Once neuron importance scores have been determined, a threshold is established to allow identification of non-critical neurons and to remove them. Even without fine-tuning, our experiments show only marginal loss in the model accuracy, while removing critical neurons, or even a random selection of neurons (independent of scores), exhibit a significant drop in accuracy.

To add to our contribution, we show that the computed neuron importance score from a specific dataset generalizes to other datasets by retraining on the pruned (sub) network, using the same initialization.

To validate our methodology, we evaluate our approach on simple ReLU neural models trained on MNIST (LeCun et al., 2010), Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky and Hinton, 2010). The evaluation includes computing the importance score of each neuron in each dataset through a MIP. Afterwards, we create different sub-networks each with different pruning methods. We compare sub-network performance between pruning non-critical neurons, randomly pruned neurons and critical neurons with the reference neural model. The number of pruned neurons at each layer is the same as the number of pruned neurons in our framework to perform a fair comparison. Another evaluation concerns the existence of sub-networks with the same initialization (lottery tickets) that generalize on different datasets. First, we compute neurons ranking of a neural model on dataset . Then, we take the same neural model initialization and masked neurons computed on dataset to retrain on dataset . These experiments show that the calculated ranking is giving the right ranking for each neuron and that it can generalize across different datasets.

Organization of the paper.

Next, in Section 1.1, we review relevant literature on neural networks sparsification. Section 1.2 provides background on the formulation of ReLU neural networks as MIPs. In Section 2, we introduce the neuron importance score and its incorporation in the mathematical programing model, while Section 3 discusses the objective function that optimizes sparsification. Section 4 provides computational experiments, and Section 5 discusses and summarizes our findings.

1.1 Related Work

Classical weight pruning methods.

LeCun et al. (1990) proposed the optimal brain damage that theoretically prunes weights having a small saliency by computing its second derivatives with respect to the objective. The objective being optimized was the model’s complexity and training error. Hassibi and Stork (1993) introduced the optimal brain surgeon that aims at removing non-critical weights determined by the Hessian computation. Another approach is presented by Chauvin (1989); Weigend et al. (1991)

, where a penalizing term to is added to the loss function during the model training (e.g. L0 or L1 norm) as a regularizer. The model is sparsified during back-propagation of the loss function. Since these classical methods depend

i) on the scale of the weights, ii) are incorporated during the learning process, and, some of them, iii) rely on computing the Hessian with respect to some objective, they turn out to be slow, requiring iterations of pruning and fine tuning to avoid loss of accuracy. Our proposed approach identifies a set of non-critical weights that when pruned together, it results in a marginal loss of accuracy without the need of fine-tuning or iterations.

General weight pruning methods.

Molchanov et al. (2016) proposed a greedy criteria-based pruning with fine tuning by back-propagation. The criteria proposed is given by the absolute difference between dense and sparse neural model loss (ranker). This cost function ensures that the model will not significantly decrease the performance. The drawback of this approach is in requiring a retraining after each layer pruning. Shrikumar et al. (2017) developed a framework that computes the neurons importance at each layer through a single backward pass. This technique compares the activation values among the neurons and assigns a contribution score to each of them based on each input data point. Other related techniques, using different objectives and interpretations of neurons importance, have been presented (Berglund et al., 2015; Barros and Weber, 2018; Liu et al., 2018; Yu et al., 2018; Hooker et al., 2019). They all demand intensive computation, while our approach aims at efficiently determine neurons importance.

Lee et al. (2018) investigates the pruning of connections, instead of entire neurons. The connections sensitivity is studied through the model’s initialization and a batch of input data. Connections sensitivities lower than a certain threshold are removed. Our method generalizes the computed neuron importance among different datasets and it creates a sparsified neural model from a pre-trained model without fine-tuning.

Structured weight filter pruning methods.

Srinivas and Babu (2015) proposed a data free way of pruning neural models by computing correlation between neurons. This correlation is computed using a saliency of two weight sets for all possible sets in the model. The saliency set computation is computationally intensive. Li et al. (2016); Molchanov et al. (2016); Jordao et al. (2018)

focus on methods to prune entire filters in convolutional neural networks which include pruning and retraining to regain lost accuracy.

He et al. (2018) proposed an approach to rank entire filters in convolutional networks. The ranking method uses the L1 norm of the weights for each filter. This approach is effective to detect non critical filters. However, it involves alternating between pruning lowest ranking filters and retraining, in other to avoid a drop in accuracy. Our work focus on non-structured weight compression and it has the ability of being applied to convolutional filters in toeplitz fully connected format.

Lottery ticket.

Frankle and Carbin (2018) introduced the lottery ticket theory that shows the existence of a lucky pruned sub-network, a winning ticket. This winning ticket can be trained effectively with less parameters, while achieving a marginal loss in accuracy. Morcos et al. (2019) proposed a technique for sparsifying over-parameterized trained neural model based on the lottery hypothesis. Their technique involves pruning the model and disabling some of its sub-networks. The pruned model can be fine-tuned on a different datasets achieving good results. To this end, the dataset used for on the pruning phase needs to be large. The lucky sub-network is found by to iteratively pruning lowest magnitude weights and retraining.

Representing ANN using MIP.

Fischetti and Jo (2018) and Anderson et al. (2019) formulate a ReLU ANN using a mixed integer programming. Tjeng et al. (2017) uses mixed integer programming to evaluate the robustness of neural models against adversarial attacks. In their proposed technique they assess the trained neural model’s sensitivity to perturbations in input images. Serra et al. (2020) also use mixed integer programming to maximize the compression of an existing neural network without any loss of accuracy. Different ways of compressing (removing neurons, folding layers, etc) are presented. However, the reported computational experiments lead only to the removal of inactive neurons. Our method has the capability to identify such neurons, as well as to additionally identify other units that would not significantly compromise accuracy.

1.2 Background and Preliminaries

Integer programs

are combinatorial optimization problems restricted to discrete decision variables, linear constraints and linear objective function. These problems are NP-complete, even when variables are restricted to only binary values 

(Garey and Johnson, 1979). The difficulty comes from ensuring integer solutions, and thus, the impossibility of using gradient methods. When continuous variables are included, they are designated by mixed integer programs

. Advances on combinatorial optimization such as branching techniques, bounds tightening, valid inequalities, decompositions and heuristics, to name few, have resulted on powerful solvers that can in practice solve MIPs of large size in seconds. We refer the reader to 

(Nemhauser and Wolsey, 1988) for an introduction on integer programming.


Considering the -th layer of a trained ReLU neural network, denotes the weight matrix and

the bias vector. Let

be the set of neurons for layer , let be a decision vector containing the neurons values of layer for each input data point , i.e., for , and . Let be a binary vector for which an entry takes value if is positive and otherwise. Finally, let and be parameter vectors indicating a valid lower and upper bounds for the value of each neuron in layer . We discuss the computation of these bounds in section 2.3. For now, we assume that and are a sufficiently small and a sufficiently large numbers, respectively.

Next, we provide the representation of ReLU neural networks of Fischetti and Jo (2018); Anderson et al. (2019). For sake of simplicity, we describe the formulation for one layer of the model and one input data point 111These constraints must be repeated for each layer and each input data point.:

if , otherwise

In (1a), the initial decision vector is forced to be equal to the input of the first layer. When an entry of is 0, constraints (1b) and (1d) force entry of to be zero, reflecting a non active neuron. If an entry of is 1, then constraints (1c) and (1e) enforce entry of to be equal to . We refer the reader to Fischetti and Jo (2018); Anderson et al. (2019) for details.

2 MIP Constraints

In what follows, we are going to adapt the previously introduced MIP formulation to quantify neuron importance. In particular, we aim to compute it for all layers in the model in an integrated fashion. Assuming we have input logits

, first layer and bounds , , if we solve the MIP at each layer by just putting constraints on the input to this layer and its output, the solver will select some neurons as important assuming the input from the previous layer is unmasked. By adapting the MIP (1) the neuron global importance can be accounted. The global neuron importance computation instead of layer by layer will give better results in terms of predictive accuracy, as shown in Yu et al. (2018).

To achieve our goal, we add a new continuous vector that will determine the importance score of each neuron in layer .

2.1 Linear Fully Connected Layers

We start by presenting the constraints used in a MIP formulation for linear fully connected layers. In neural models, ReLU activation functions are mostly used after each layer. These type of activation functions are used to increase the model capacity and to introduce nonlinearity between its layers. However, we start by explaining the introduced neuron importance score variable

in its simplest setup on linear layers. To this end, we consider the following set of constraints:


where the output of layer is denoted by the decision vector , and its value is based on and .

A neuron at layer having a non-critical score results in a negative decision variable , by subtracting the input logit upper bound in (2a). The negative decision variable of the non-critical neuron will be disabled and set to zero by the ReLU’s gating decision variable . Therefore, neurons having a score near zero importance are considered non-critical under our model, and, under the optimized model, they can be safely sparsified without losing accuracy. Conversely, neurons having an importance score (or even ) model a case where their activation is critical to the neural network operation. Constraint (2b) ensures our neuron scores are between these 0 and 1, and thus quantify this notion of importance.

2.2 ReLU Fully Connected Layers

In ReLU activated fully connected layers, we use the previously introduced binary vector to act as a gate for the ReLU function. Using along with continuous neurons importance score at layer will allow the MIP to give a score to each neuron. In this way, we obtain the following MIP:


We extend ReLU constraints (1) by adding continuous neuron importance vector to constraints (1e) and (1c).

In (3), setting to zero would zero out the output of the next layer at index by the ReLU. Constraints (3a) and (3b) contain the decision gating variable that is choosing whether to enable the logit output or not along with the neuron score . Subtracting the upper bound weighted by criticality score of neurons in (3a) and (3b) will give the MIP the freedom of reducing and increasing the weights of each neuron in the fully connected setup (or convolutional layers converted to toeplitz matrix Gray (2000)).

Constraint (3c) enforces a neuron score in the range . Relaxing the constraint (1f) of the gating variable to become a continuous variable in will approximate the ReLU activation value with marginal gap222The gap gives the distance between the best upper bound and the best lower bound computed for the problem at hand (optimal solutions will have zero gap).(Anderson et al., 2019). Furthermore, the relaxation boosts the solver’s speed from minutes to seconds. Another way of boosting solver’s speed is to use tight bounds on decision variables, reducing the search space.

2.3 Bounds Propagation

In the previous MIP formulation, we assumed a large upper bound and small lower bound . However, using large bounds may lead to long computational times. In order to overcome this issue, we assume an upper and a lower bound at layer for each input point with very small difference between input maximum value and lower value.


Propagating the initial bounds of the input images throughout the trained model will create upper and lower bounds for input point in each layer.

In our formulation, we use tight lower and upper bounds to keep under control the approximation of the trained neural model when is relaxed. With tighter bounds, now the MIP has a narrow optimization space of possible feasible solutions satisfying the constraints (3).

3 MIP Objectives

The objective we set for the proposed framework is to sparsify non-critical neurons without reducing the predictive accuracy of the neural model. To this end, we use two optimization objectives.

Our first optimization objective is to maximize the number of neurons sparsified from the trained neural model. The trained neural model has layers, and we are following the same notation explained in 1.2.

Let be the sum of neuron importance scores at layer having neurons scaled down to range . Consider the case where we had neurons having score in layer , and in another arbitrary layer we have a set of neurons having same score . In that case, both layers will have the same sum of scores, which would confuse the solver when narrowing down to the optimal solution. Giving lower value to non-critical neurons and minimizing helps the solver tend to sparsify more neurons while satisfying the used constraints.

In order to create a relation between neuron scores in different layers, our objective becomes the maximization of the amount of neurons sparsified from layers having higher score . Hence, we denote and formulate the sparsity loss as


Here, the objective is to maximize the number of non-critical neurons at each layer compared to other layers in the trained neural model. The sparsity quantification is then normalized by the total number of neurons.

Our second optimization objective is to minimize the loss of (important) information due to the sparsification of the trained neural model. Further, we aim for this minimization to be done without relying on the values of the logits, which are closely correlated with neurons pruned at each layer which would drive the MIP to simply give a full score of to all neurons in order to keep same output logit value. Instead, we formulate this optimization objective using the marginal softmax proposed by Gimpel and Smith (2010). Using marginal softmax allows the solver to focus on minimizing the misclassification error without relying on logits values. Softmax marginal loss avoids putting large weight on logits coming from the trained neural model and predicted logits from decision vector computed by the MIP. On the other hand, it tries to optimize for the label having the highest logit value. Formally, we write the objective


where index stands for the class label. The function (3) minimizes the decision variables in that model the output of layer , which is computed from the MIP, while keeping its value at the correct label . The used marginal softmax objective keeps the correct predictions of the trained model for the input batch of images

having one hot encoded labels

without considering the logit value.

Finally, we combine the two objectives to formulate the multiobjective loss


as weighted sum of sparsification regularizer and marginal softmax, as proposed by Ehrgott (2005).

Figure 2: Effect of changing value of when pruning LeNet model trained on Fashion MNIST.

Note that our objective function (7) is implicitly using a Lagrangian relaxation, where is the Lagrange multiplier. In fact, one would like to control the loss on accuracy (3) by imposing a constraint for a very small , or even to avoid any loss via . However, this would introduce a nonlinear constraint which would be hard to handle. Thus, for tractability purposes we follow a Lagrangian relaxation on this constraint, and penalize the objective whenever is positive. Accordingly with the weak (Lagrangian) duality theorem, the objective (7) is always a lower bound to the problem where we minimize sparsity and we impose a bound on the accuracy loss. Furthermore, the problem of finding the best for the Lagrangian relaxation, formulated as


has the well-known property of being concave, which in our experiments revealed to be easily determined333We remark that if the trained model has misclassifications, there is no guarantee that problem (8) is concave.. We note that the value of introduces a trade off between pruning more neurons and predictive capacity of the model. For example, increasing the value of would prune fewer neurons as shown in Figure 2, but the accuracy on the test set will increase.

4 Experiments

We validate the proposed approach on several simple architectures proposed by LeCun et al. (1998). We first create different sub-networks where the same number of neurons is removed in each layer to allow a fair comparison among them. These sub-networks are obtained through different procedures based on the computed neuron importance score: noncritical, critical and randomly pruned neurons. We evaluate the test accuracy of each introduced sub-network to show how meaningful is the computed neuron importance score. We then compute a pruned sub-network on MNIST dataset and we retrain it on both Fashion MNIST and CIFAR-10 to show generalization of lucky sub-networks across different datasets (lottery hypothesis). Finally, we compare to Yu et al. (2018), which is used to compute connections sensitivity and to create a sparsified sub-network based on the input dataset and model initialization.

4.1 Experimental Setting

All models were trained for 20 epochs using Rmsprop optimizer with 1e-3 learning rate for MNIST and Fashion MNIST. Lenet 5

(LeCun et al., 1998) on CIFAR-10 was trained using SGD optimizer with learning rate 1e-2 and 256 epochs. The hyper parameters were tuned on the validation set’s accuracy.

All images were resized to 32 by 32 and converted to 3 channels to allow pruned model generalization across different datasets. The MIP is fed with a balanced set of images, each representing a class of the classification task. The aim is to avoid that the determined importance score leads to pruning neurons (features) critical to a class for which a small number of images exists in the input set.

The proposed framework, recall Figure 1, computes the importance score of each neuron, and with a small threshold, we start masking non-critical neurons with a score lower than it. The masked neural model is then directly used to infer labels of the test set without significant loss of accuracy. In order to regain the lost accuracy, we applied fine-tuning using a single epoch on the masked model. We used in the MIP objective function (7).

We used a simple fully connected (FC-3) model 3-layer NN, 300+100 hidden units by LeCun et al. (1998), and another simple fully connected model (FC-4) 4-layer NN, 200+100+100 hidden units. In addition, we used convolutional LeNet 5 (LeCun et al., 1998). Each of these models was trained 5 times with different initialization.

The experiments were performed in an Intel(R) Xeon(R) CPU @ 2.30GHz with 12 GB Ram and Tesla k80 using Mosek 9.1.11 (Mosek, 2010)

solver and pytorch 1.3.1

(Paszke et al., 2019).

4.2 MIP Input Verification

Figure 3: Effect of changing validation set of images input to the MIP when pruning LeNet model trained on Fashion MNIST on pruning percentage and test Accuracy.

In this experiment, we assess the robustness of our MIP formulation against different batches of input images to the MIP. Namely, we used 25 randomly sampled balanced images from the validation set. The randomly selected images can have misclassified images by the neural model.

As Figure 3 illustrates, changing the input images used by the MIP to compute neuron importance score resulted in marginal changes in the test accuracy between different batches. We remark that the input batches may contain images that were misclassified by the trained neural model. In this case, the MIP tries to use the score to achieve the true label. The marginal fluctuation of pruning percentage and test accuracy shown in Figure 3 between different batches depends on the accuracy of the input batch optimized by the MIP.

4.3 Comparison to Random and Critical Pruning

Dataset MNIST Fashion-MNIST CIFAR-10
Model FC-3 FC-4 Lenet FC-3 FC-4 Lenet
Ours + ft
Prune (%)
MIP Time (s) 444Relaxed the ReLU constraints for faster solving time with marginal gap
Table 1: Pruning results on different fully connected FC-3, FC-4 and convolutional Lenet-5 models using different datasets. We compare the test accuracy between reference model (Ref.), randomly pruned model (RP.), model pruned based on critical neurons selected by the MIP (CP.) and our non-critical pruning approach with (ours + ft) fine-tuning for 1 epoch and without (ours).

We start by training a reference model (Ref.) using the training parameters in Section 4.1. After training and evaluating the reference model on the test set, we fed an input batch of images from the validation set to the MIP. The MIP solver computes the neuron importance score based on the input images fed to it. In our experimental setup, we used images, each representing a class. We pruned any neuron having a score less than in all FC-models and

for Lenet-5. In convolutional models, neurons of the fully connected bottleneck of a classifier are critical to have a lossless compression. The MIP solver tends to give more neurons a score above

. Thus, to achieve a meaningful pruning percentage, we used a higher threshold.

What happens if we mask critical neurons instead of non-critical neurons ?

Would the model’s accuracy on test set drop after masking critical neurons ? We test our approach by pruning neurons having top score from each layer. In that case, our experiments of Table 1 show a significant drop in the test accuracy when compared with the reference model. Having a significant drop in the test accuracy shows that the neuron score is keeping the critical neurons in order to get a close to lossless compression.

What will happen if we fine-tune our pruned model for just 1 epoch ?

We used this experiment to show that in some cases after fine-tuning for just 1 epoch the model’s accuracy can surpass the reference model. Since the MIP is solving its marginal softmax (3) on true labels, the generated sub-network, after fine-tuning, can surpass the reference model.

4.4 Generalization Between Different Datasets

In this experiment, we show that our proposed approach of computing neuron importance generalizes across different datasets using the same model’s initialization. In this experiment, we train the model on a dataset and we create a masked neural model using our approach. After creating the masked model, we restart the model to its original initialization. Finally, the new masked model is re-trained on another dataset , and its generalization is analyzed.

Dataset Ref. Masked Pruning (%)
Test Acc.
Fashion MNIST
Table 2: Sub-network computed from MNIST Lenet-5 generalization using retraining with same network’s early initialization on Fashion MNIST and CIFAR-10 using 2 different initialization pruned from Mnist .

In Table 2, the source dataset is MNIST, and we generalize it to Fashion-MNIST and CIFAR-10 using Lenet-5 convolutional model.

When we compare generalization results to pruning using our approach on Fashion-MNIST and CIFAR-10, Table 1, we discover that computing sub-network architecture on MNIST whose model is achieving a test acc. is creating a more sparse sub-network with better test accuracy. This behavior is happening because the solver is optimizing on a batch of images that are classified correctly with high confidence from the trained model.

4.5 Comparison to SNIP

Our proposed framework can be viewed as a compression technique of over-parameterized neural models. In what follows, we compare it to the state-of-the-art framework: SNIP (Lee et al., 2018). SNIP creates the sparse model before training the neural model by computing the sensitivity of connections. This allows the identification of the important connections. In our methodology, we exclusively identify important neurons and prune all its connections. In order to create a fair comparison between both frameworks, we compute neuron importance on the model’s initialization555Remark: we used and pruning threshold and keep ratio for SNIP, training procedures as in Section 4.1. We used only 10 images as input to the MIP and 128 images as input to SNIP.

Our MIP formulation was only able to prunes neurons from fully connected bottleneck layer of Lenet-5. After creating the sparse network using both SNIP and MIP, we then trained them on Fashion-MNIST dataset. The difference between SNIP () and our approach () was marginal in terms of test accuracy. SNIP pruned of the model’s parameters and our approach . SNIP marginally outperformed our approach in terms of pruning proportion. This is explained by the fact that we are only pruning from fully connected layers and pruning entire neurons only from the bottleneck would result on a weaker model. We end this section by highlighting, that we achieved results comparable with SNIP using a much smaller set of images (10 Vs 128).

5 Discussion

We proposed using a mixed integer program that incorporates the computation of neurons importance score in a deep neural network with ReLUs. This is a very first step in the direction of understanding which neurons are critical for the model’s capacity.

Our contribution mainly focus on providing a scalable way of computing the importance score of each neurons in a fully connected layer. Hence, allowing us to create efficient sub-networks able to be trained on different datasets.


This work was partially funded by: IVADO (l’institut de valorisation des données) [G.W., M.C.]; NIH grant R01GM135929 [G.W.

]; FRQ-IVADO Research Chair in Data Science for Combinatorial Game Theory, and NSERC grant 2019-04557 [



  • R. Anderson, J. Huchette, C. Tjandraatmadja, and J. P. Vielma (2019) Strong mixed-integer programming formulations for trained neural networks. In International Conference on Integer Programming and Combinatorial Optimization, pp. 27–42. Cited by: §1, §1.1, §1.2, §2.2.
  • L. Barros and B. Weber (2018) CrossTalk proposal: an important astrocyte-to-neuron lactate shuttle couples neuronal activity to glucose utilisation in the brain. The Journal of physiology 596 (3), pp. 347–350. Cited by: §1.1.
  • Y. Bengio, I. Goodfellow, and A. Courville (2017) Deep learning. Vol. 1, Citeseer. Cited by: §1.
  • M. Berglund, T. Raiko, and K. Cho (2015)

    Measuring the usefulness of hidden units in boltzmann machines with mutual information

    Neural Networks 64, pp. 12–18. Cited by: §1.1.
  • Y. Chauvin (1989) A back-propagation algorithm with optimal use of hidden units. In Advances in neural information processing systems, pp. 519–526. Cited by: §1.1.
  • X. Dong, S. Chen, and S. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §1.
  • M. Ehrgott (2005) Multicriteria optimization. Vol. 491, Springer Science & Business Media. Cited by: §3.
  • M. Fischetti and J. Jo (2018) Deep neural networks and mixed integer linear optimization. Constraints 23 (3), pp. 296–309. Cited by: §1, §1.1, §1.2.
  • J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1.1.
  • M. R. Garey and D. S. Johnson (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., New York, NY, USA. External Links: ISBN 0716710447 Cited by: §1.2.
  • K. Gimpel and N. A. Smith (2010) Softmax-margin crfs: training log-linear models with cost functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 733–736. Cited by: §3.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §1.
  • R. M. Gray (2000) Toeplitz and circulant matrices: a review, 2002. URL http://ee. stanford. edu/~ gray/toeplitz. pdf. Cited by: §1, §2.2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
  • B. Hassibi, D. G. Stork, and G. J. Wolff (1993) Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. Cited by: §1.
  • B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §1.1.
  • Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018) Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §1.1.
  • S. Hooker, D. Erhan, P. Kindermans, and B. Kim (2019) A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pp. 9734–9745. Cited by: §1.1.
  • A. Jordao, R. Kloss, F. Yamada, and W. R. Schwartz (2018) Pruning deep neural networks using partial least squares. arXiv preprint arXiv:1810.07610. Cited by: §1.1.
  • A. Krizhevsky and G. Hinton (2010)

    Convolutional deep belief networks on cifar-10

    Unpublished manuscript 40 (7), pp. 1–9. Cited by: §1.
  • N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, and F. Kawsar (2015) An early resource characterization of deep learning on wearables, smartphones and internet-of-things devices. In Proceedings of the 2015 international workshop on internet of things towards applications, pp. 7–12. Cited by: §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1, §4.1, §4.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2, pp. 18. Cited by: §1.
  • Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.1, §1.
  • N. Lee, T. Ajanthan, and P. H. Torr (2018) Snip: single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340. Cited by: §1.1, §1, §4.5.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1.1.
  • H. Li, K. Ota, and M. Dong (2018) Learning iot in edge: deep learning for the internet of things with edge computing. IEEE network 32 (1), pp. 96–101. Cited by: §1.
  • K. Liu, R. A. Amjad, and B. C. Geiger (2018) Understanding individual neuron importance using information theory. arXiv preprint arXiv:1804.06679. Cited by: §1.1.
  • P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016) Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440. Cited by: §1.1, §1.1.
  • A. S. Morcos, H. Yu, M. Paganini, and Y. Tian (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. arXiv preprint arXiv:1906.02773. Cited by: §1.1.
  • A. Mosek (2010) The mosek optimization software. Online at http://www. mosek. com 54 (2-1), pp. 5. Cited by: §4.1.
  • G. L. Nemhauser and L. A. Wolsey (1988) Integer and combinatorial optimization. Wiley-Interscience, New York, NY, USA. External Links: ISBN 0-471-82819-X Cited by: §1.2.
  • B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro (2018) Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076. Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.1.
  • A. Salama, O. Ostapenko, T. Klein, and M. Nabi (2019) Pruning at a glance: global neural pruning for model compression. arXiv preprint arXiv:1912.00200. Cited by: §1.
  • T. Serra, A. Kumar, and S. Ramalingam (2020) Lossless compression of deep neural networks. arXiv preprint arXiv:2001.00218. Cited by: §1.1, §1.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 3145–3153. Cited by: §1.1.
  • S. Srinivas and R. V. Babu (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §1.1, §1.
  • V. Tjeng, K. Xiao, and R. Tedrake (2017) Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356. Cited by: §1.1.
  • C. Wang, R. Grosse, S. Fidler, and G. Zhang (2019) Eigendamage: structured pruning in the kronecker-factored eigenbasis. arXiv preprint arXiv:1905.05934. Cited by: §1.
  • A. S. Weigend, D. E. Rumelhart, and B. A. Huberman (1991) Generalization by weight-elimination with application to forecasting. In Advances in neural information processing systems, pp. 875–882. Cited by: §1.1.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §1.
  • R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin, and L. S. Davis (2018) Nisp: pruning networks using neuron importance score propagation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9194–9203. Cited by: §1.1, §2, §4.
  • W. Zeng and R. Urtasun (2018) MLPrune: multi-layer pruning for automated neural network compression. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.