1 Introduction
Deep neural networks typically have large memory and compute requirements, making it difficult to deploy them on small devices such as mobiles and tablets [KimICLR2016]. These cumbersome networks are pruned by removing their redundant weights, layers, filters and blocks [Tan2019EfficientNetRM, he2018amc, lemaire2019structured, YanMIPR2020, davis2020state, UmairICIP2021]. Neural network pruning strategies can be grouped into three categories namely i) Offline pruning, ii) Online pruning and iii) Pruning via Reinforcement Learning.
Offline Pruning requires multiple iterations of training, pruning, and finetuning. magnitudebased pruning [Han2015nips] which works on ‘magnitude equals salience’ principle. [lecun1990obd, LebedevCVPR2016] uses second derivative of weights to rank connections, More recent approaches suggest lookahead pruning strategies [lee2021lookahead, YanMIPR2020] where magnitude of previous connected layers is also considered. Layers are pruned independently to speed up the pruning process in [dong2017layerobd]. Instead of reducing the original, deep network (called the “teacher”) by pruning connections, knowledge distillation [hinton2015distilling] trains another, smaller network (called the “student”) to mimic the output of the teacher network. However, it requires designing an ad hoc student network which is a tedious task.
Online Pruning
: A more recent class of techniques poses the problem of pruning as a learning problem by introducing a mask vector that acts as a gate or an indicator function to turn on/off a particular component (connection, filter, layer, block)
[li2017filter, HuangECCV2018, ZhanhongPCNNDAC2020, UmairICIP2021]. The mask vector can be treated as a trainable parameter and is directly optimized through gradient descent. The mask can also be obtained via an auxiliary neural network [lin2017runtime, he2018amc]. Pruning via Surrogate Lagrangian Relaxation (PSLR) [GurevinArxiv2020] utilizing surrogate gradient algorithm for Lagrangian Relaxation [ZhaoSLR1997]. [lemaire2019structured, Enderich_2021_WACV, Tan2019EfficientNetRM, Enderich_2021_WACV] uses Budget Aware Pruning hat tries to reduce the size of a network to respect a specific budget on computational and space complexitie these budgets introduce an arbitrary function that is nondifferentiable. However, one shortcoming of current budgetaware techniques is that they require the function on which the budget is being imposed to be either differentiable or fully specified [lemaire2019structured]. This, however, is not always possible when we, for example, want to impose budgets on metrics such as inference time. The main question, then, is the following: can we prune neural networks to respect budgets on arbitrary, possibly nondifferentiable, functions? One way to solve this problem is to leverage some recent techniques in reinforcement learning (RL) [sutton1998rl, reinforcementlearning].Pruning via Reinforcement Learning: In reinforcement learning [reinforcementlearning] an agent interacts with the environment around them by taking different actions. Each action results in the agent receiving a reward depending upon how good or bad the action is. If the agent is trained to predict sparsity as action, then the accuracy of the obtained model can be used as a reward, resulting in RL based neural network pruning [he2018amc]. Reinforcement learning (RL) [sutton1998rl, reinforcementlearning] is known to optimize arbitrary nondifferentiable functions while respecting budget on computational and space complexities. One such example is AutoML for Model Compression (AMC) [he2018amc] in which an agent is trained to predict the sparsity ratio for each layer. Accuracy is returned as a reward to encourage the agent to build smaller, faster, and more accurate networks. Similarly, Conditional Automated Channel Pruning (CACP) [CACP2021IEEESPL] is another RLbased method that simultaneously produces compressed models under different compression rates through a single layerbylayer channel pruning process. While these methods allow to include both sparsity and accuracy, they lack finegrained control over these constraints.
Our main contributions: In this work, we formulate the problem of pruning in the constrained reinforcement learning framework (CRL). We propose a general methodology for pruning neural networks, which can prune neural networks to respect specified budgets on arbitrary, possibly nondifferentiable, functions. Furthermore, we only assume the ability to be able to evaluate these functions for different inputs, and hence they do not need to be fully specified beforehand. Our proposed CRL framework outperform stateoftheart methods in compressing popular models such as VGG [simonyan2015vgg] and ResNets [he2015resnet] on benchmark datasets.
2 Pruning via Constrained Reinforcement Learning
Constrained reinforcement learning [altmanconstrainedMDP] is an extension of the reinforcement learning problem in which agents, in addition to the reward, also receive a cost. The agent’s goal is to maximize its cumulative reward subject to its cumulative cost being less than some predefined threshold. One interesting property here is that neither the reward nor the cost function needs to be differentiable or fully specified. The agent simply needs to be fed a scalar reward and a scalar cost value each time it performs an action.
Let denote the parameters of a neural network. Each element of represents the weight of a single connection. Removing a connection is thus equivalent to multiplying its weight by . Let
denote the loss function of the neural network. Pruning, in its most general form, tries to find a mask
that solves the following optimization:(1) 
where is the elementwise Hadamard product, be some arbitrary, possibly nondifferentiable function, and is a known constant. Here represents the computational and space complexity of the network. For example, if we want to compress the network by at least (in terms of the space it occupies), we can let be the norm and to be equal to . Similarly, if we wanted to optimize for speed, we could set to be the number of flops the network consumes (or the time it takes) when it is pruned according to .
In our proposed approach, the mask is modeled via action performed by an agent who is interacting with the environment. In the beginning, the environment loads a pretrained network (e.g., VGG16 [simonyan2015vgg]). The agent is then fed the layer representations of the first convolutional layer in the network (as defined in the previous section). The agent then specifies an appropriate action. This action is then recorded, and the agent is then fed the next state. Once the agent has finished predicting its actions for all convolutional layers, the network is pruned according to the agent’s specified actions and finetuned on the training set (this is the same training set that was used to pretrain the network). The accuracy of the finetuned network is returned as the agent’s reward along with the cost (the reward and cost at all other timesteps are ). Figure 1 shows an illustration of our pruning process.
2.1 Constrained Markov Decision Process
Constrained Markov Decision Processes
: In order to formulate our problem as a constrained RL problem, we need to first define our constrained MDP (CMDP). Let define a neural network with convolutional layers, followed by a few fully connected layers collectively denoted as . We wish to prune . Furthermore, let denote the parameters corresponding to layer . We define the key components of the CMDP as follows:
State, : Each convolutional layer corresponds to a single state. Similar to the scheme in [he2018amc] each of these layers is represented by the following tuple:
(layer (
), input channels, number. of filters, kernel size, stride, padding)
where is the index of that layer and the remaining entries are the attributes of a convolutional layer.

Action, : for the action, representations we end up with a mask vector at each layer . The length of this mask vector will be equal to the number of filters for . We use to collectively denote the mask vectors for all layers.

Transition function: The agent always transitions from state to state , and so the transitions are fixed and independent of the agent’s actions.

Reward function, : Let be a batch of input examples (uniformly) sampled from the training dataset. We define the reward as follows:
(2) where is the loss function of the network evaluated on the batch .

Cost function, : the cost function is defined as
(3) where is our constraint function evaluated on the batch .

Budget, : This the budget on we wish our pruned network to respect.
The policy predicts a sparsity ratio for each layer. Filters are then pruned via magnitudebased pruning [Han2015nips] upto the desired sparsity.
2.2 Algorithm
Let be the dimension of the action. Also, let denote the dimension of the state vector (recall that each state vector corresponds to a single layer).
We model our policy as a (diagonal) multivariate Gaussian distribution
. Here is a neural network with parameters , is a trainable vector andis the identity matrix (hence
is the covariance matrix of the distribution). The network takes in as input a state vector provided by environment and outputs a vector of dimension . We then simply sample an action from and feed it to the environment. In parallel, we also train two other neural networks, and with parameters and. The cost value function is also defined similarly for the cost function. Recall that having these networks helps us reduce variance.
We initialize our policy and value networks randomly and collect data from the environment. Each data point is essentially a tuple where and are the states at time and respectively, is the action taken at time and and are the reward and cost received as a consequence of taking action . We use this dataset to update our parameters. Furthermore, we initialize our Lagrange multiplier with a constant value and also update it using this dataset. This entire process is repeated until convergence.
Recall that at each iteration, we are interested in optimizing (here we decompose the expectation over trajectories into expectation over states and actions): J_LAG^PPO(λ,π_θ) = ∑_t=1^T ∑_s_t,a_t∼D[πθ(st,at)π¯θ(st,at)J^r(s_t,a_t,s_t+1)  λ(E_τ∼π_θ[J^c(s_t,a_t,s_t+1)]  α)], where is the budget and the losses are:
(4) 
(5) 
All parameters are updated via to gradient descent using their respective learning rates . Specifically, the policy network is updated as:
(6) 
and the Lagrange multiplier as:
(7) 
Furthermore, we define the loss for the reward value function network as:
(8) 
and update its parameters as:
(9) 
Similarly, the loss for the cost value function network is defined as:
(10) 
and its parameters are updated as:
(11) 
3 Results
3.1 Experimental Setup
We evaluated our approach using CIFAR10
[krizhevsky2009learning] dataset on ResNet18 [he2015resnet] and variants of VGG [simonyan2015vgg] network. The training was performed using Adam optimizer [kingma2014method] using learning rates and batch size . Experiments were conducted on VGG11, VGG16, VGG19, and ResNet18. For each model, we ran the experiments using different budget values. Initially, the policy network was trained for a fixed number of iterations. The policy network then pruned the original model by predicting the sparsity ratio for every convolutional layer using the PPOLagrangian algorithm. The Lagrangian multiplier was initialized with a fixed value as , and and are initialized as andrespectively. Once the network was pruned, it was finetuned for a certain number of iterations. Since finetuning is computationally expensive, we adopted a schedule with hyperparameter values of 0, 32, 128, respectively. It means that we finetune less in the beginning and more towards the end. In practice, we normalize our rewards and cost values with a running mean and standard deviation which is continually updated as more data is collected. Furthermore, we also normalize the state vector in a similar way to improve stability.
3.2 Ablative Study
To demonstrate the efficacy of the proposed constrained RL method, we performed experiments on VGG11, VGG16, and VGG19 [simonyan2015vgg] and compared the results with the magnitudebased pruning method. We train each of these networks on the CIFAR10 dataset [krizhevsky2009learning]. Pretrained models of VGG were used to train the policy network for K iterations. During pruning, the models were finetuned by K, K, K iterations, respectively, in a finetuning schedule. We experimented with budget values and . Increasing the value decreases the sparsity of the pruned network. The ablative study showed that the proposed constrained RL method is significantly more optimal than magnitudebased pruning (see Table 1). Our method achieved higher accuracy than magnitudebased pruning in all experiments. In fact, it even outperformed the unpruned network in 4 out of 6 cases.
Sparsity ()  Acc. ()  Sparsity ()  Acc. ()  

VGG11 
Unpruned  ””  89.23  ””  89.23 
Magnitudebased Pruning  80.00  85.50  90.00  83.80  
Proposed CRL  83.48  89.11  90.75  88.09  
VGG16 
Unpruned  ””  90.69  ””  90.69 
Magnitudebased Pruning  80.00  88.40  90.00  87.10  
Proposed CRL  83.81  90.96  92.90  89.89  
VGG19 
Unpruned  ””  90.59  ””  90.59 
Magnitudebased Pruning  80.00  88.40  90.00  86.90  
Proposed CRLCoarse  83.48  91.06  92.31  91.31 
3.3 Comparison with Stateoftheart
To prove the effectiveness of the proposed constrained RL approach, we also conducted experiments with ResNet18, VGG16 and Resnet50 to compared our results with stateoftheart methods on the CIFAR10 dataset.
In the case of VGG16, we compared our method with two stateoftheart methods, namely Conditional Automated Channel Pruning (CACP) [CACP2021IEEESPL], and GhostNet [HanGhostNet2020CVPR]. Our pruned VGG16 model performed better than both stateoftheart methods. Note that when , our pruned model contained 2.38M parameters, which are fewer than the pruned models of both CACP and GhostNet, which contained 4.41M and 3.30M parameters. Despite this, our pruned VGG16 still produced an accuracy change of which is better than both stateoftheart methods and our baseline unpruned VGG16 model. Moreover, when , our pruned model contained only 1.05M parameters, but it had an accuracy change similar to GhostNet, having 3.30M parameters.
Table 2 also shows the comparison of proposed method with four stateoftheart methods on ResNet18 architecture and CIFAR10 dataset. The stateoftheart methods include Prune it Yourself (PIY) [YanMIPR2020], Conditional Automated Channel Pruning (CACP) [CACP2021IEEESPL], Lagrangian Relaxation (PSLR) [GurevinArxiv2020], and PCNN [ZhanhongPCNNDAC2020]. It can be seen that the proposed method outperformed all the methods in terms of accuracy and three out of four stateoftheart methods in terms of compression. Table 2 also shows the comparison of proposed method with stateoftheart AMC [he2018amc] method on ResNet50. Our method at outperforms it.
Unpruned  Pruned  
Acc.  Params.  Acc.  Params.  Acc.  
in millions  in millions  
VGG16 
CACP [CACP2021IEEESPL]  93.02  14.73  92.89  4.41  0.13 
Ghost Net [HanGhostNet2020CVPR]  93.60  14.73  92.90  3.30  0.70  
CRL  90.69  14.73  90.96  2.38  0.27  
CRL  90.69  14.73  89.89  1.05  0.80  
ResNet18 
PIY [YanMIPR2020]  91.78  11.18  91.23  6.11  0.55 
CACP [CACP2021IEEESPL]  93.02  11.68  92.03  3.50  0.99  
PSLR [GurevinArxiv2020]  93.33  11.68  90.37  1.34  2.96  
PCNN [ZhanhongPCNNDAC2020]  96.58  11.20  96.38  3.80  0.20  
CRL  91.82  11.68  92.09  2.91  0.27  
CRL  91.82  11.68  90.97  2.52  0.85  
ResNet50 
AMC [he2018amc]  93.53  25.56  93.55  15.34  0.02 
CRL  92.97  25.56  93.60  12.34  0.63 
4 Conclusions
We propose a novel framework for neural network pruning via constrained reinforcement learning that allows respecting budgets on arbitrary, possibly nondifferentiable functions. Ours is a proLagrangian approach that incorporates budget constraints by constructing a trust region containing all policies that respect constraints. Our experiments show that the proposed CRL strategy significantly outperform the stateoftheart methods in terms of producing small and compact while maintaining the accuracy of unpruned baseline architecture. Specifically, our method reduces nearly parameters without incurring any significant loss in performance.
Comments
There are no comments yet.