1 Introduction
Deep learning has proven its ability to solve complex tasks and to achieve stateoftheart results in various domains such as image classification, speech recognition, machine translation, robotics and control (Bengio et al., 2017; LeCun et al., 2015). Overparameterized deep neural models with more parameters than the training samples can be used to achieve stateofthe art results on various tasks (Zhang et al., 2016; Neyshabur et al., 2018). However, the large number of parameters comes at the expense of computational cost in terms of memory footprint, training time and inference time on resourcelimited IOT devices (Lane et al., 2015; Li et al., 2018).
In this context, pruning neurons from an overparameterized neural model has been an active research area. This remains a challenging open problem whose solution has the potential to increase computational efficiency and to uncover potential subnetworks that can be trained effectively. Neural Network pruning techniques (LeCun et al., 1990; Hassibi et al., 1993; Han et al., 2015; Srinivas and Babu, 2015; Dong et al., 2017; Zeng and Urtasun, 2018; Lee et al., 2018; Wang et al., 2019; Salama et al., 2019; Serra et al., 2020) have been introduced to sparsify the model without loss of accuracy.
Most existing work focus on identifying redundant parameters and noncritical neurons to achieve a lossless sparsification of the neural model. The typical sparsification procedure includes training a neural model, then computing parameters importance and pruning existing parameters using certain criteria and finetuning the neural model to regain its lost accuracy. Furthermore, existing pruning and ranking procedures are computationally expensive, require iterations of finetuning on the sparsified model and no experiments were conducted to check the generalization of sparsified models across different datasets.
We remark that sparse neuron connectivity is often used by modern network architectures, and perhaps most notably in convolutional layers used in image processing. Indeed, the limited size of the parameter space in such cases increases the effectiveness of network training and enables learning of meaningful semantic features from the input images Goodfellow et al. (2016). Inspired by the benefits of sparsity in such architecture designs, we aim to leverage the neuron sparsity achieved by our framework to achieve optimized neural architecture that can generalize well across different datasets. For this purpose, we create a sparse subnetwork by optimizing on one dataset and then train the same architecture (i.e., masked) on another dataset indicating a promising direction of future research into the utilization of our approach for effective automatic architecture tuning to augment handcrafted network architecture design.
Contributions.
In our proposed framework, illustrated in Figure 1, we formalize the notation of neuron importance as a score between 0 and 1 for a trained neural network. The neuron importance score reflects how much activity decrease can be inflicted in it, while controlling the loss on the neural network model accuracy. Concretely, we propose a mixed integer programming formulation (MIP) that allows the computation of each neuron score on a fully connected layer and that takes into account the error propagation between the different layers. The representation of trained ReLU neural networks through MIPs has previously been proposed by, e.g., Fischetti and Jo (2018); Anderson et al. (2019) to check neural model robustness. In our framework, we further advance these MIP formulations by introducing new variables quantifying each neuron importance score. The motivation to use such approach comes from the existence of powerful techniques to solve MIPs efficiently in practice, and consequently, the scalability of this procedure to large ReLU neural models. In addition, the proposed formulation can be extended from fully connected layers to convolutional layers when converted to toeplitz format (Gray, 2000).
Our MIP formulation uses a small balanced subset of data points (as small as one per class) to approximate the neurons importance. We perform computational tests to validate the robustness of this approximation by randomly choosing such subset of data points. Once neuron importance scores have been determined, a threshold is established to allow identification of noncritical neurons and to remove them. Even without finetuning, our experiments show only marginal loss in the model accuracy, while removing critical neurons, or even a random selection of neurons (independent of scores), exhibit a significant drop in accuracy.
To add to our contribution, we show that the computed neuron importance score from a specific dataset generalizes to other datasets by retraining on the pruned (sub) network, using the same initialization.
To validate our methodology, we evaluate our approach on simple ReLU neural models trained on MNIST (LeCun et al., 2010), FashionMNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky and Hinton, 2010). The evaluation includes computing the importance score of each neuron in each dataset through a MIP. Afterwards, we create different subnetworks each with different pruning methods. We compare subnetwork performance between pruning noncritical neurons, randomly pruned neurons and critical neurons with the reference neural model. The number of pruned neurons at each layer is the same as the number of pruned neurons in our framework to perform a fair comparison. Another evaluation concerns the existence of subnetworks with the same initialization (lottery tickets) that generalize on different datasets. First, we compute neurons ranking of a neural model on dataset . Then, we take the same neural model initialization and masked neurons computed on dataset to retrain on dataset . These experiments show that the calculated ranking is giving the right ranking for each neuron and that it can generalize across different datasets.
Organization of the paper.
Next, in Section 1.1, we review relevant literature on neural networks sparsification. Section 1.2 provides background on the formulation of ReLU neural networks as MIPs. In Section 2, we introduce the neuron importance score and its incorporation in the mathematical programing model, while Section 3 discusses the objective function that optimizes sparsification. Section 4 provides computational experiments, and Section 5 discusses and summarizes our findings.
1.1 Related Work
Classical weight pruning methods.
LeCun et al. (1990) proposed the optimal brain damage that theoretically prunes weights having a small saliency by computing its second derivatives with respect to the objective. The objective being optimized was the model’s complexity and training error. Hassibi and Stork (1993) introduced the optimal brain surgeon that aims at removing noncritical weights determined by the Hessian computation. Another approach is presented by Chauvin (1989); Weigend et al. (1991)
, where a penalizing term to is added to the loss function during the model training (e.g. L0 or L1 norm) as a regularizer. The model is sparsified during backpropagation of the loss function. Since these classical methods depend
i) on the scale of the weights, ii) are incorporated during the learning process, and, some of them, iii) rely on computing the Hessian with respect to some objective, they turn out to be slow, requiring iterations of pruning and fine tuning to avoid loss of accuracy. Our proposed approach identifies a set of noncritical weights that when pruned together, it results in a marginal loss of accuracy without the need of finetuning or iterations.General weight pruning methods.
Molchanov et al. (2016) proposed a greedy criteriabased pruning with fine tuning by backpropagation. The criteria proposed is given by the absolute difference between dense and sparse neural model loss (ranker). This cost function ensures that the model will not significantly decrease the performance. The drawback of this approach is in requiring a retraining after each layer pruning. Shrikumar et al. (2017) developed a framework that computes the neurons importance at each layer through a single backward pass. This technique compares the activation values among the neurons and assigns a contribution score to each of them based on each input data point. Other related techniques, using different objectives and interpretations of neurons importance, have been presented (Berglund et al., 2015; Barros and Weber, 2018; Liu et al., 2018; Yu et al., 2018; Hooker et al., 2019). They all demand intensive computation, while our approach aims at efficiently determine neurons importance.
Lee et al. (2018) investigates the pruning of connections, instead of entire neurons. The connections sensitivity is studied through the model’s initialization and a batch of input data. Connections sensitivities lower than a certain threshold are removed. Our method generalizes the computed neuron importance among different datasets and it creates a sparsified neural model from a pretrained model without finetuning.
Structured weight filter pruning methods.
Srinivas and Babu (2015) proposed a data free way of pruning neural models by computing correlation between neurons. This correlation is computed using a saliency of two weight sets for all possible sets in the model. The saliency set computation is computationally intensive. Li et al. (2016); Molchanov et al. (2016); Jordao et al. (2018)
focus on methods to prune entire filters in convolutional neural networks which include pruning and retraining to regain lost accuracy.
He et al. (2018) proposed an approach to rank entire filters in convolutional networks. The ranking method uses the L1 norm of the weights for each filter. This approach is effective to detect non critical filters. However, it involves alternating between pruning lowest ranking filters and retraining, in other to avoid a drop in accuracy. Our work focus on nonstructured weight compression and it has the ability of being applied to convolutional filters in toeplitz fully connected format.Lottery ticket.
Frankle and Carbin (2018) introduced the lottery ticket theory that shows the existence of a lucky pruned subnetwork, a winning ticket. This winning ticket can be trained effectively with less parameters, while achieving a marginal loss in accuracy. Morcos et al. (2019) proposed a technique for sparsifying overparameterized trained neural model based on the lottery hypothesis. Their technique involves pruning the model and disabling some of its subnetworks. The pruned model can be finetuned on a different datasets achieving good results. To this end, the dataset used for on the pruning phase needs to be large. The lucky subnetwork is found by to iteratively pruning lowest magnitude weights and retraining.
Representing ANN using MIP.
Fischetti and Jo (2018) and Anderson et al. (2019) formulate a ReLU ANN using a mixed integer programming. Tjeng et al. (2017) uses mixed integer programming to evaluate the robustness of neural models against adversarial attacks. In their proposed technique they assess the trained neural model’s sensitivity to perturbations in input images. Serra et al. (2020) also use mixed integer programming to maximize the compression of an existing neural network without any loss of accuracy. Different ways of compressing (removing neurons, folding layers, etc) are presented. However, the reported computational experiments lead only to the removal of inactive neurons. Our method has the capability to identify such neurons, as well as to additionally identify other units that would not significantly compromise accuracy.
1.2 Background and Preliminaries
Integer programs
are combinatorial optimization problems restricted to discrete decision variables, linear constraints and linear objective function. These problems are NPcomplete, even when variables are restricted to only binary values
(Garey and Johnson, 1979). The difficulty comes from ensuring integer solutions, and thus, the impossibility of using gradient methods. When continuous variables are included, they are designated by mixed integer programs. Advances on combinatorial optimization such as branching techniques, bounds tightening, valid inequalities, decompositions and heuristics, to name few, have resulted on powerful solvers that can in practice solve MIPs of large size in seconds. We refer the reader to
(Nemhauser and Wolsey, 1988) for an introduction on integer programming.Preliminaries.
Considering the th layer of a trained ReLU neural network, denotes the weight matrix and
the bias vector. Let
be the set of neurons for layer , let be a decision vector containing the neurons values of layer for each input data point , i.e., for , and . Let be a binary vector for which an entry takes value if is positive and otherwise. Finally, let and be parameter vectors indicating a valid lower and upper bounds for the value of each neuron in layer . We discuss the computation of these bounds in section 2.3. For now, we assume that and are a sufficiently small and a sufficiently large numbers, respectively.Next, we provide the representation of ReLU neural networks of Fischetti and Jo (2018); Anderson et al. (2019). For sake of simplicity, we describe the formulation for one layer of the model and one input data point ^{1}^{1}1These constraints must be repeated for each layer and each input data point.:
(1a)  
if , otherwise  
(1b)  
(1c)  
(1d)  
(1e)  
(1f) 
In (1a), the initial decision vector is forced to be equal to the input of the first layer. When an entry of is 0, constraints (1b) and (1d) force entry of to be zero, reflecting a non active neuron. If an entry of is 1, then constraints (1c) and (1e) enforce entry of to be equal to . We refer the reader to Fischetti and Jo (2018); Anderson et al. (2019) for details.
2 MIP Constraints
In what follows, we are going to adapt the previously introduced MIP formulation to quantify neuron importance. In particular, we aim to compute it for all layers in the model in an integrated fashion. Assuming we have input logits
, first layer and bounds , , if we solve the MIP at each layer by just putting constraints on the input to this layer and its output, the solver will select some neurons as important assuming the input from the previous layer is unmasked. By adapting the MIP (1) the neuron global importance can be accounted. The global neuron importance computation instead of layer by layer will give better results in terms of predictive accuracy, as shown in Yu et al. (2018).To achieve our goal, we add a new continuous vector that will determine the importance score of each neuron in layer .
2.1 Linear Fully Connected Layers
We start by presenting the constraints used in a MIP formulation for linear fully connected layers. In neural models, ReLU activation functions are mostly used after each layer. These type of activation functions are used to increase the model capacity and to introduce nonlinearity between its layers. However, we start by explaining the introduced neuron importance score variable
in its simplest setup on linear layers. To this end, we consider the following set of constraints:(2a)  
(2b) 
where the output of layer is denoted by the decision vector , and its value is based on and .
A neuron at layer having a noncritical score results in a negative decision variable , by subtracting the input logit upper bound in (2a). The negative decision variable of the noncritical neuron will be disabled and set to zero by the ReLU’s gating decision variable . Therefore, neurons having a score near zero importance are considered noncritical under our model, and, under the optimized model, they can be safely sparsified without losing accuracy. Conversely, neurons having an importance score (or even ) model a case where their activation is critical to the neural network operation. Constraint (2b) ensures our neuron scores are between these 0 and 1, and thus quantify this notion of importance.
2.2 ReLU Fully Connected Layers
In ReLU activated fully connected layers, we use the previously introduced binary vector to act as a gate for the ReLU function. Using along with continuous neurons importance score at layer will allow the MIP to give a score to each neuron. In this way, we obtain the following MIP:
(3a)  
(3b)  
(3c)  
We extend ReLU constraints (1) by adding continuous neuron importance vector to constraints (1e) and (1c).
In (3), setting to zero would zero out the output of the next layer at index by the ReLU. Constraints (3a) and (3b) contain the decision gating variable that is choosing whether to enable the logit output or not along with the neuron score . Subtracting the upper bound weighted by criticality score of neurons in (3a) and (3b) will give the MIP the freedom of reducing and increasing the weights of each neuron in the fully connected setup (or convolutional layers converted to toeplitz matrix Gray (2000)).
Constraint (3c) enforces a neuron score in the range . Relaxing the constraint (1f) of the gating variable to become a continuous variable in will approximate the ReLU activation value with marginal gap^{2}^{2}2The gap gives the distance between the best upper bound and the best lower bound computed for the problem at hand (optimal solutions will have zero gap).(Anderson et al., 2019). Furthermore, the relaxation boosts the solver’s speed from minutes to seconds. Another way of boosting solver’s speed is to use tight bounds on decision variables, reducing the search space.
2.3 Bounds Propagation
In the previous MIP formulation, we assumed a large upper bound and small lower bound . However, using large bounds may lead to long computational times. In order to overcome this issue, we assume an upper and a lower bound at layer for each input point with very small difference between input maximum value and lower value.
(4a)  
(4b)  
(4c) 
Propagating the initial bounds of the input images throughout the trained model will create upper and lower bounds for input point in each layer.
In our formulation, we use tight lower and upper bounds to keep under control the approximation of the trained neural model when is relaxed. With tighter bounds, now the MIP has a narrow optimization space of possible feasible solutions satisfying the constraints (3).
3 MIP Objectives
The objective we set for the proposed framework is to sparsify noncritical neurons without reducing the predictive accuracy of the neural model. To this end, we use two optimization objectives.
Our first optimization objective is to maximize the number of neurons sparsified from the trained neural model. The trained neural model has layers, and we are following the same notation explained in 1.2.
Let be the sum of neuron importance scores at layer having neurons scaled down to range . Consider the case where we had neurons having score in layer , and in another arbitrary layer we have a set of neurons having same score . In that case, both layers will have the same sum of scores, which would confuse the solver when narrowing down to the optimal solution. Giving lower value to noncritical neurons and minimizing helps the solver tend to sparsify more neurons while satisfying the used constraints.
In order to create a relation between neuron scores in different layers, our objective becomes the maximization of the amount of neurons sparsified from layers having higher score . Hence, we denote and formulate the sparsity loss as
(5) 
Here, the objective is to maximize the number of noncritical neurons at each layer compared to other layers in the trained neural model. The sparsity quantification is then normalized by the total number of neurons.
Our second optimization objective is to minimize the loss of (important) information due to the sparsification of the trained neural model. Further, we aim for this minimization to be done without relying on the values of the logits, which are closely correlated with neurons pruned at each layer which would drive the MIP to simply give a full score of to all neurons in order to keep same output logit value. Instead, we formulate this optimization objective using the marginal softmax proposed by Gimpel and Smith (2010). Using marginal softmax allows the solver to focus on minimizing the misclassification error without relying on logits values. Softmax marginal loss avoids putting large weight on logits coming from the trained neural model and predicted logits from decision vector computed by the MIP. On the other hand, it tries to optimize for the label having the highest logit value. Formally, we write the objective
(6) 
where index stands for the class label. The function (3) minimizes the decision variables in that model the output of layer , which is computed from the MIP, while keeping its value at the correct label . The used marginal softmax objective keeps the correct predictions of the trained model for the input batch of images
having one hot encoded labels
without considering the logit value.Finally, we combine the two objectives to formulate the multiobjective loss
(7) 
as weighted sum of sparsification regularizer and marginal softmax, as proposed by Ehrgott (2005).
Note that our objective function (7) is implicitly using a Lagrangian relaxation, where is the Lagrange multiplier. In fact, one would like to control the loss on accuracy (3) by imposing a constraint for a very small , or even to avoid any loss via . However, this would introduce a nonlinear constraint which would be hard to handle. Thus, for tractability purposes we follow a Lagrangian relaxation on this constraint, and penalize the objective whenever is positive. Accordingly with the weak (Lagrangian) duality theorem, the objective (7) is always a lower bound to the problem where we minimize sparsity and we impose a bound on the accuracy loss. Furthermore, the problem of finding the best for the Lagrangian relaxation, formulated as
(8) 
has the wellknown property of being concave, which in our experiments revealed to be easily determined^{3}^{3}3We remark that if the trained model has misclassifications, there is no guarantee that problem (8) is concave.. We note that the value of introduces a trade off between pruning more neurons and predictive capacity of the model. For example, increasing the value of would prune fewer neurons as shown in Figure 2, but the accuracy on the test set will increase.
4 Experiments
We validate the proposed approach on several simple architectures proposed by LeCun et al. (1998). We first create different subnetworks where the same number of neurons is removed in each layer to allow a fair comparison among them. These subnetworks are obtained through different procedures based on the computed neuron importance score: noncritical, critical and randomly pruned neurons. We evaluate the test accuracy of each introduced subnetwork to show how meaningful is the computed neuron importance score. We then compute a pruned subnetwork on MNIST dataset and we retrain it on both Fashion MNIST and CIFAR10 to show generalization of lucky subnetworks across different datasets (lottery hypothesis). Finally, we compare to Yu et al. (2018), which is used to compute connections sensitivity and to create a sparsified subnetwork based on the input dataset and model initialization.
4.1 Experimental Setting
All models were trained for 20 epochs using Rmsprop optimizer with 1e3 learning rate for MNIST and Fashion MNIST. Lenet 5
(LeCun et al., 1998) on CIFAR10 was trained using SGD optimizer with learning rate 1e2 and 256 epochs. The hyper parameters were tuned on the validation set’s accuracy.All images were resized to 32 by 32 and converted to 3 channels to allow pruned model generalization across different datasets. The MIP is fed with a balanced set of images, each representing a class of the classification task. The aim is to avoid that the determined importance score leads to pruning neurons (features) critical to a class for which a small number of images exists in the input set.
The proposed framework, recall Figure 1, computes the importance score of each neuron, and with a small threshold, we start masking noncritical neurons with a score lower than it. The masked neural model is then directly used to infer labels of the test set without significant loss of accuracy. In order to regain the lost accuracy, we applied finetuning using a single epoch on the masked model. We used in the MIP objective function (7).
We used a simple fully connected (FC3) model 3layer NN, 300+100 hidden units by LeCun et al. (1998), and another simple fully connected model (FC4) 4layer NN, 200+100+100 hidden units. In addition, we used convolutional LeNet 5 (LeCun et al., 1998). Each of these models was trained 5 times with different initialization.
4.2 MIP Input Verification
In this experiment, we assess the robustness of our MIP formulation against different batches of input images to the MIP. Namely, we used 25 randomly sampled balanced images from the validation set. The randomly selected images can have misclassified images by the neural model.
As Figure 3 illustrates, changing the input images used by the MIP to compute neuron importance score resulted in marginal changes in the test accuracy between different batches. We remark that the input batches may contain images that were misclassified by the trained neural model. In this case, the MIP tries to use the score to achieve the true label. The marginal fluctuation of pruning percentage and test accuracy shown in Figure 3 between different batches depends on the accuracy of the input batch optimized by the MIP.
4.3 Comparison to Random and Critical Pruning
Dataset  MNIST  FashionMNIST  CIFAR10  
Model  FC3  FC4  Lenet  FC3  FC4  Lenet  
Method  
Ref.  
RP.  
CP.  
Ours  
Ours + ft  
Prune (%)  
MIP Time (s)  ^{4}^{4}4Relaxed the ReLU constraints for faster solving time with marginal gap 
We start by training a reference model (Ref.) using the training parameters in Section 4.1. After training and evaluating the reference model on the test set, we fed an input batch of images from the validation set to the MIP. The MIP solver computes the neuron importance score based on the input images fed to it. In our experimental setup, we used images, each representing a class. We pruned any neuron having a score less than in all FCmodels and
for Lenet5. In convolutional models, neurons of the fully connected bottleneck of a classifier are critical to have a lossless compression. The MIP solver tends to give more neurons a score above
. Thus, to achieve a meaningful pruning percentage, we used a higher threshold.What happens if we mask critical neurons instead of noncritical neurons ?
Would the model’s accuracy on test set drop after masking critical neurons ? We test our approach by pruning neurons having top score from each layer. In that case, our experiments of Table 1 show a significant drop in the test accuracy when compared with the reference model. Having a significant drop in the test accuracy shows that the neuron score is keeping the critical neurons in order to get a close to lossless compression.
What will happen if we finetune our pruned model for just 1 epoch ?
We used this experiment to show that in some cases after finetuning for just 1 epoch the model’s accuracy can surpass the reference model. Since the MIP is solving its marginal softmax (3) on true labels, the generated subnetwork, after finetuning, can surpass the reference model.
4.4 Generalization Between Different Datasets
In this experiment, we show that our proposed approach of computing neuron importance generalizes across different datasets using the same model’s initialization. In this experiment, we train the model on a dataset and we create a masked neural model using our approach. After creating the masked model, we restart the model to its original initialization. Finally, the new masked model is retrained on another dataset , and its generalization is analyzed.
Dataset  Ref.  Masked  Pruning (%) 

Test Acc.  
Fashion MNIST  
CIFAR10  
In Table 2, the source dataset is MNIST, and we generalize it to FashionMNIST and CIFAR10 using Lenet5 convolutional model.
When we compare generalization results to pruning using our approach on FashionMNIST and CIFAR10, Table 1, we discover that computing subnetwork architecture on MNIST whose model is achieving a test acc. is creating a more sparse subnetwork with better test accuracy. This behavior is happening because the solver is optimizing on a batch of images that are classified correctly with high confidence from the trained model.
4.5 Comparison to SNIP
Our proposed framework can be viewed as a compression technique of overparameterized neural models. In what follows, we compare it to the stateoftheart framework: SNIP (Lee et al., 2018). SNIP creates the sparse model before training the neural model by computing the sensitivity of connections. This allows the identification of the important connections. In our methodology, we exclusively identify important neurons and prune all its connections. In order to create a fair comparison between both frameworks, we compute neuron importance on the model’s initialization^{5}^{5}5Remark: we used and pruning threshold and keep ratio for SNIP, training procedures as in Section 4.1. We used only 10 images as input to the MIP and 128 images as input to SNIP.
Our MIP formulation was only able to prunes neurons from fully connected bottleneck layer of Lenet5. After creating the sparse network using both SNIP and MIP, we then trained them on FashionMNIST dataset. The difference between SNIP () and our approach () was marginal in terms of test accuracy. SNIP pruned of the model’s parameters and our approach . SNIP marginally outperformed our approach in terms of pruning proportion. This is explained by the fact that we are only pruning from fully connected layers and pruning entire neurons only from the bottleneck would result on a weaker model. We end this section by highlighting, that we achieved results comparable with SNIP using a much smaller set of images (10 Vs 128).
5 Discussion
We proposed using a mixed integer program that incorporates the computation of neurons importance score in a deep neural network with ReLUs. This is a very first step in the direction of understanding which neurons are critical for the model’s capacity.
Our contribution mainly focus on providing a scalable way of computing the importance score of each neurons in a fully connected layer. Hence, allowing us to create efficient subnetworks able to be trained on different datasets.
Acknowledgements
This work was partially funded by: IVADO (l’institut de valorisation des données) [G.W., M.C.]; NIH grant R01GM135929 [G.W.
]; FRQIVADO Research Chair in Data Science for Combinatorial Game Theory, and NSERC grant 201904557 [
M.C.].References
 Strong mixedinteger programming formulations for trained neural networks. In International Conference on Integer Programming and Combinatorial Optimization, pp. 27–42. Cited by: §1, §1.1, §1.2, §2.2.
 CrossTalk proposal: an important astrocytetoneuron lactate shuttle couples neuronal activity to glucose utilisation in the brain. The Journal of physiology 596 (3), pp. 347–350. Cited by: §1.1.
 Deep learning. Vol. 1, Citeseer. Cited by: §1.

Measuring the usefulness of hidden units in boltzmann machines with mutual information
. Neural Networks 64, pp. 12–18. Cited by: §1.1.  A backpropagation algorithm with optimal use of hidden units. In Advances in neural information processing systems, pp. 519–526. Cited by: §1.1.
 Learning to prune deep neural networks via layerwise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §1.
 Multicriteria optimization. Vol. 491, Springer Science & Business Media. Cited by: §3.
 Deep neural networks and mixed integer linear optimization. Constraints 23 (3), pp. 296–309. Cited by: §1, §1.1, §1.2.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1.1.
 Computers and intractability: a guide to the theory of NPcompleteness. W. H. Freeman & Co., New York, NY, USA. External Links: ISBN 0716710447 Cited by: §1.2.
 Softmaxmargin crfs: training loglinear models with cost functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 733–736. Cited by: §3.
 Deep learning. MIT press. Cited by: §1.
 Toeplitz and circulant matrices: a review, 2002. URL http://ee. stanford. edu/~ gray/toeplitz. pdf. Cited by: §1, §2.2.
 Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
 Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. Cited by: §1.
 Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §1.1.
 Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §1.1.
 A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pp. 9734–9745. Cited by: §1.1.
 Pruning deep neural networks using partial least squares. arXiv preprint arXiv:1810.07610. Cited by: §1.1.

Convolutional deep belief networks on cifar10
. Unpublished manuscript 40 (7), pp. 1–9. Cited by: §1.  An early resource characterization of deep learning on wearables, smartphones and internetofthings devices. In Proceedings of the 2015 international workshop on internet of things towards applications, pp. 7–12. Cited by: §1.
 Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1, §4.1, §4.
 MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2, pp. 18. Cited by: §1.
 Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.1, §1.
 Snip: singleshot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340. Cited by: §1.1, §1, §4.5.
 Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1.1.
 Learning iot in edge: deep learning for the internet of things with edge computing. IEEE network 32 (1), pp. 96–101. Cited by: §1.
 Understanding individual neuron importance using information theory. arXiv preprint arXiv:1804.06679. Cited by: §1.1.
 Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440. Cited by: §1.1, §1.1.
 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. arXiv preprint arXiv:1906.02773. Cited by: §1.1.
 The mosek optimization software. Online at http://www. mosek. com 54 (21), pp. 5. Cited by: §4.1.
 Integer and combinatorial optimization. WileyInterscience, New York, NY, USA. External Links: ISBN 047182819X Cited by: §1.2.
 Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076. Cited by: §1.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.1.
 Pruning at a glance: global neural pruning for model compression. arXiv preprint arXiv:1912.00200. Cited by: §1.
 Lossless compression of deep neural networks. arXiv preprint arXiv:2001.00218. Cited by: §1.1, §1.

Learning important features through propagating activation differences.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 3145–3153. Cited by: §1.1.  Datafree parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §1.1, §1.
 Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356. Cited by: §1.1.
 Eigendamage: structured pruning in the kroneckerfactored eigenbasis. arXiv preprint arXiv:1905.05934. Cited by: §1.
 Generalization by weightelimination with application to forecasting. In Advances in neural information processing systems, pp. 875–882. Cited by: §1.1.
 Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §1.

Nisp: pruning networks using neuron importance score propagation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9194–9203. Cited by: §1.1, §2, §4.  MLPrune: multilayer pruning for automated neural network compression. In International Conference on Learning Representations (ICLR), Cited by: §1.
 Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.