Deep neural networks typically have large memory and compute requirements, making it difficult to deploy them on small devices such as mobiles and tablets [KimICLR2016]. These cumbersome networks are pruned by removing their redundant weights, layers, filters and blocks [Tan2019EfficientNetRM, he2018amc, lemaire2019structured, YanMIPR2020, davis2020state, UmairICIP2021]. Neural network pruning strategies can be grouped into three categories namely i) Offline pruning, ii) Online pruning and iii) Pruning via Reinforcement Learning.
Offline Pruning requires multiple iterations of training, pruning, and fine-tuning. magnitude-based pruning [Han2015nips] which works on ‘magnitude equals salience’ principle. [lecun1990obd, LebedevCVPR2016] uses second derivative of weights to rank connections, More recent approaches suggest look-ahead pruning strategies [lee2021lookahead, YanMIPR2020] where magnitude of previous connected layers is also considered. Layers are pruned independently to speed up the pruning process in [dong2017layerobd]. Instead of reducing the original, deep network (called the “teacher”) by pruning connections, knowledge distillation [hinton2015distilling] trains another, smaller network (called the “student”) to mimic the output of the teacher network. However, it requires designing an ad hoc student network which is a tedious task.
: A more recent class of techniques poses the problem of pruning as a learning problem by introducing a mask vector that acts as a gate or an indicator function to turn on/off a particular component (connection, filter, layer, block)[li2017filter, HuangECCV2018, ZhanhongPCNNDAC2020, UmairICIP2021]. The mask vector can be treated as a trainable parameter and is directly optimized through gradient descent. The mask can also be obtained via an auxiliary neural network [lin2017runtime, he2018amc]. Pruning via Surrogate Lagrangian Relaxation (P-SLR) [GurevinArxiv2020] utilizing surrogate gradient algorithm for Lagrangian Relaxation [ZhaoSLR1997]. [lemaire2019structured, Enderich_2021_WACV, Tan2019EfficientNetRM, Enderich_2021_WACV] uses Budget Aware Pruning hat tries to reduce the size of a network to respect a specific budget on computational and space complexitie these budgets introduce an arbitrary function that is non-differentiable. However, one shortcoming of current budget-aware techniques is that they require the function on which the budget is being imposed to be either differentiable or fully specified [lemaire2019structured]. This, however, is not always possible when we, for example, want to impose budgets on metrics such as inference time. The main question, then, is the following: can we prune neural networks to respect budgets on arbitrary, possibly non-differentiable, functions? One way to solve this problem is to leverage some recent techniques in reinforcement learning (RL) [sutton1998rl, reinforcementlearning].
Pruning via Reinforcement Learning: In reinforcement learning [reinforcementlearning] an agent interacts with the environment around them by taking different actions. Each action results in the agent receiving a reward depending upon how good or bad the action is. If the agent is trained to predict sparsity as action, then the accuracy of the obtained model can be used as a reward, resulting in RL based neural network pruning [he2018amc]. Reinforcement learning (RL) [sutton1998rl, reinforcementlearning] is known to optimize arbitrary non-differentiable functions while respecting budget on computational and space complexities. One such example is AutoML for Model Compression (AMC) [he2018amc] in which an agent is trained to predict the sparsity ratio for each layer. Accuracy is returned as a reward to encourage the agent to build smaller, faster, and more accurate networks. Similarly, Conditional Automated Channel Pruning (CACP) [CACP2021IEEESPL] is another RL-based method that simultaneously produces compressed models under different compression rates through a single layer-by-layer channel pruning process. While these methods allow to include both sparsity and accuracy, they lack fine-grained control over these constraints.
Our main contributions: In this work, we formulate the problem of pruning in the constrained reinforcement learning framework (CRL). We propose a general methodology for pruning neural networks, which can prune neural networks to respect specified budgets on arbitrary, possibly non-differentiable, functions. Furthermore, we only assume the ability to be able to evaluate these functions for different inputs, and hence they do not need to be fully specified beforehand. Our proposed CRL framework outperform state-of-the-art methods in compressing popular models such as VGG [simonyan2015vgg] and ResNets [he2015resnet] on benchmark datasets.
2 Pruning via Constrained Reinforcement Learning
Constrained reinforcement learning [altman-constrainedMDP] is an extension of the reinforcement learning problem in which agents, in addition to the reward, also receive a cost. The agent’s goal is to maximize its cumulative reward subject to its cumulative cost being less than some pre-defined threshold. One interesting property here is that neither the reward nor the cost function needs to be differentiable or fully specified. The agent simply needs to be fed a scalar reward and a scalar cost value each time it performs an action.
Let denote the parameters of a neural network. Each element of represents the weight of a single connection. Removing a connection is thus equivalent to multiplying its weight by . Let
denote the loss function of the neural network. Pruning, in its most general form, tries to find a maskthat solves the following optimization:
where is the element-wise Hadamard product, be some arbitrary, possibly non-differentiable function, and is a known constant. Here represents the computational and space complexity of the network. For example, if we want to compress the network by at least (in terms of the space it occupies), we can let be the -norm and to be equal to . Similarly, if we wanted to optimize for speed, we could set to be the number of flops the network consumes (or the time it takes) when it is pruned according to .
In our proposed approach, the mask is modeled via action performed by an agent who is interacting with the environment. In the beginning, the environment loads a pre-trained network (e.g., VGG16 [simonyan2015vgg]). The agent is then fed the layer representations of the first convolutional layer in the network (as defined in the previous section). The agent then specifies an appropriate action. This action is then recorded, and the agent is then fed the next state. Once the agent has finished predicting its actions for all convolutional layers, the network is pruned according to the agent’s specified actions and fine-tuned on the training set (this is the same training set that was used to pre-train the network). The accuracy of the fine-tuned network is returned as the agent’s reward along with the cost (the reward and cost at all other time-steps are ). Figure 1 shows an illustration of our pruning process.
2.1 Constrained Markov Decision Process
Constrained Markov Decision Processes
Constrained Markov Decision Processes: In order to formulate our problem as a constrained RL problem, we need to first define our constrained MDP (CMDP). Let define a neural network with convolutional layers, followed by a few fully connected layers collectively denoted as . We wish to prune . Furthermore, let denote the parameters corresponding to layer . We define the key components of the CMDP as follows:
State, : Each convolutional layer corresponds to a single state. Similar to the scheme in [he2018amc] each of these layers is represented by the following tuple:
where is the index of that layer and the remaining entries are the attributes of a convolutional layer.
Action, : for the action, representations we end up with a mask vector at each layer . The length of this mask vector will be equal to the number of filters for . We use to collectively denote the mask vectors for all layers.
Transition function: The agent always transitions from state to state , and so the transitions are fixed and independent of the agent’s actions.
Reward function, : Let be a batch of input examples (uniformly) sampled from the training dataset. We define the reward as follows:
where is the loss function of the network evaluated on the batch .
Cost function, : the cost function is defined as
where is our constraint function evaluated on the batch .
Budget, : This the budget on we wish our pruned network to respect.
The policy predicts a sparsity ratio for each layer. Filters are then pruned via magnitude-based pruning [Han2015nips] upto the desired sparsity.
Let be the dimension of the action. Also, let denote the dimension of the state vector (recall that each state vector corresponds to a single layer).
We model our policy as a (diagonal) multivariate Gaussian distribution. Here is a neural network with parameters , is a trainable vector and
is the identity matrix (henceis the covariance matrix of the distribution). The network takes in as input a state vector provided by environment and outputs a vector of dimension . We then simply sample an action from and feed it to the environment. In parallel, we also train two other neural networks, and with parameters and
. The cost value function is also defined similarly for the cost function. Recall that having these networks helps us reduce variance.
We initialize our policy and value networks randomly and collect data from the environment. Each data point is essentially a tuple where and are the states at time and respectively, is the action taken at time and and are the reward and cost received as a consequence of taking action . We use this dataset to update our parameters. Furthermore, we initialize our Lagrange multiplier with a constant value and also update it using this dataset. This entire process is repeated until convergence.
Recall that at each iteration, we are interested in optimizing (here we decompose the expectation over trajectories into expectation over states and actions): J_LAG^PPO(λ,π_θ) = ∑_t=1^T ∑_s_t,a_t∼D[πθ(st,at)π¯θ(st,at)J^r(s_t,a_t,s_t+1) - λ(E_τ∼π_θ[J^c(s_t,a_t,s_t+1)] - α)], where is the budget and the losses are:
All parameters are updated via to gradient descent using their respective learning rates . Specifically, the policy network is updated as:
and the Lagrange multiplier as:
Furthermore, we define the loss for the reward value function network as:
and update its parameters as:
Similarly, the loss for the cost value function network is defined as:
and its parameters are updated as:
3.1 Experimental Setup
We evaluated our approach using CIFAR-10[krizhevsky2009learning] dataset on ResNet18 [he2015resnet] and variants of VGG [simonyan2015vgg] network. The training was performed using Adam optimizer [kingma2014method] using learning rates and batch size . Experiments were conducted on VGG11, VGG16, VGG19, and ResNet18. For each model, we ran the experiments using different budget values. Initially, the policy network was trained for a fixed number of iterations. The policy network then pruned the original model by predicting the sparsity ratio for every convolutional layer using the PPO-Lagrangian algorithm. The Lagrangian multiplier was initialized with a fixed value as , and and are initialized as and
respectively. Once the network was pruned, it was fine-tuned for a certain number of iterations. Since fine-tuning is computationally expensive, we adopted a schedule with hyperparameter values of 0, 32, 128, respectively. It means that we fine-tune less in the beginning and more towards the end. In practice, we normalize our rewards and cost values with a running mean and standard deviation which is continually updated as more data is collected. Furthermore, we also normalize the state vector in a similar way to improve stability.
3.2 Ablative Study
To demonstrate the efficacy of the proposed constrained RL method, we performed experiments on VGG11, VGG16, and VGG19 [simonyan2015vgg] and compared the results with the magnitude-based pruning method. We train each of these networks on the CIFAR-10 dataset [krizhevsky2009learning]. Pretrained models of VGG were used to train the policy network for K iterations. During pruning, the models were fine-tuned by K, K, K iterations, respectively, in a fine-tuning schedule. We experimented with budget values and . Increasing the value decreases the sparsity of the pruned network. The ablative study showed that the proposed constrained RL method is significantly more optimal than magnitude-based pruning (see Table 1). Our method achieved higher accuracy than magnitude-based pruning in all experiments. In fact, it even outperformed the unpruned network in 4 out of 6 cases.
|Sparsity ()||Acc. ()||Sparsity ()||Acc. ()|
3.3 Comparison with State-of-the-art
To prove the effectiveness of the proposed constrained RL approach, we also conducted experiments with ResNet18, VGG16 and Resnet50 to compared our results with state-of-the-art methods on the CIFAR-10 dataset.
In the case of VGG16, we compared our method with two state-of-the-art methods, namely Conditional Automated Channel Pruning (CACP) [CACP2021IEEESPL], and GhostNet [HanGhostNet2020CVPR]. Our pruned VGG16 model performed better than both state-of-the-art methods. Note that when , our pruned model contained 2.38M parameters, which are fewer than the pruned models of both CACP and GhostNet, which contained 4.41M and 3.30M parameters. Despite this, our pruned VGG16 still produced an accuracy change of which is better than both state-of-the-art methods and our baseline unpruned VGG16 model. Moreover, when , our pruned model contained only 1.05M parameters, but it had an accuracy change similar to GhostNet, having 3.30M parameters.
Table 2 also shows the comparison of proposed method with four state-of-the-art methods on ResNet18 architecture and CIFAR-10 dataset. The state-of-the-art methods include Prune it Yourself (PIY) [YanMIPR2020], Conditional Automated Channel Pruning (CACP) [CACP2021IEEESPL], Lagrangian Relaxation (P-SLR) [GurevinArxiv2020], and PCNN [ZhanhongPCNNDAC2020]. It can be seen that the proposed method outperformed all the methods in terms of accuracy and three out of four state-of-the-art methods in terms of compression. Table 2 also shows the comparison of proposed method with state-of-the-art AMC [he2018amc] method on ResNet50. Our method at outperforms it.
|in millions||in millions|
|Ghost Net [HanGhostNet2020CVPR]||93.60||14.73||92.90||3.30||-0.70|
We propose a novel framework for neural network pruning via constrained reinforcement learning that allows respecting budgets on arbitrary, possibly non-differentiable functions. Ours is a pro-Lagrangian approach that incorporates budget constraints by constructing a trust region containing all policies that respect constraints. Our experiments show that the proposed CRL strategy significantly outperform the state-of-the-art methods in terms of producing small and compact while maintaining the accuracy of unpruned baseline architecture. Specifically, our method reduces nearly parameters without incurring any significant loss in performance.