1 Introduction
Recent advances in language modeling have led to remarkable improvements on a variety of natural language tasks. These models, however, have grown increasingly large Dai et al. (2019), rendering them slow and costly. Through the use of model compression, we aim to reduce this overhead, and to better understand the role of model capacity in language models.
A common approach to model compression is known as weight pruning Zhu and Gupta (2017); Han et al. (2015). Model weights are progressively removed, resulting in sparse matrices across the network. Earlier work focuses mostly on unstructured pruning, where weights are pruned individually Narang et al. (2017a); Zhu and Gupta (2017). While this method is effective, it results in unstructured sparse matrices that are difficult to support on common hardware Han et al. (2016), making it challenging to obtain inference speedups, despite a significant reduction in model size.
On the other hand, structured pruning Narang et al. (2017b); Wen et al. (2017); Cao et al. (2019); Yao et al. (2019) imposes highly structured sparse weight matrices that can either directly use optimized dense linear algebra primitives or admit efficient implementations Gray et al. (2017); Yao et al. (2019). These techniques lead to significant speedup but tend to give lower performance than unstructured pruning Yao et al. (2019) with the same parameter budget, due to imposing larger constraints on the pruning process.
In order to alleviate these constraints, we propose a novel structured pruning technique, based on lowrank factorization and norm regularization Louizos et al. (2017). The lowrank factorization allows us to retain the dense structure of the matrices, while the regularization relaxes the constraints imposed from structured pruning, by allowing the network to choose which weights to remove. We factorize the weight matrices into the product of two smaller matrices, and set a diagonal mask between these two matrices. We prune the mask during training via regularization, and use an augmented Lagrangian approach inspired by Bastings et al. (2019) to control the final sparsity level of the model. Our method, which we refer to as FLOP (Factorized L0 Pruning), is generic, and can be applied to any matrix multiplication.
Experimental results on language modeling and language understanding tasks with recurrent and Transformer Vaswani et al. (2017) architectures indicate that our method either outperforms or matches the performance of state of the art unstructured pruning Zhu and Gupta (2017), while also providing up to 2x speedup at inference. Our results also demonstrate that pruning larger models yields much higher performance than training a smaller model from scratch, shining a light on the role of model capacity in language modeling.
2 Related Work
Model compression has three main categories: weight pruning Narang et al. (2017a); Zhu and Gupta (2017). knowledge distillation Ba and Caruana (2014); Hinton et al. (2015), and quantization Han et al. (2015). Our work is focused on weight pruning, and is compatible with these other methods.
Most previous work only considers unstructured pruning based on magnitude Zhu and Gupta (2017); Frankle and Carbin (2019), or through variational dropout Gale et al. (2019); Molchanov et al. (2017). Our method aims to prune weights in a structured manner. Louizos et al. (2017) proposes a relaxation of the regularization. We modify this method by first factorizing the weight matrices, and second, by using an Augmented Lagrangian method to control and anneal the target sparsity. Other works also attempt structured pruning but do not consider the regularization approach Narang et al. (2017b); Wen et al. (2017); Cao et al. (2019); Yao et al. (2019). Voita et al. (2019) also uses regularization but prunes full attention heads in transformer models on machine translation benchmarks. Our method generalizes to any matrix multiplication through factorization and leverages Augmented Lagrangian methods to reach the target sparsity.
Recently there has been efforts in trying to compress the BERT model on downstream tasks. Such methods include knowledge distillation Chia et al. (2019). We show that weight pruning is also a viable option, and leave it to future work to combine these different methods.
3 Method
We begin by formulating model pruning as an endtoend learning problem, following the prior work of Louizos et al. (2017). In the subsequent sections, we introduce two novel revisions over this method, providing improved pruning performance and explicitly controlled model size after pruning.
3.1 Pruning via norm regularization
Consider a given neural network model
parameterized by , where each represents an individual parameter weight or a block of weights (e.g. a column of a weight matrix) anddenotes the number of blocks. A pruning strategy of the model can be parameterized by introducing additional binary variables
such that andHere denotes the set of model parameters after pruning and its norm, , measures the effective size of the pruned model.
The choice of binary variables can be regulated by some prior distribution and optimized given the training data. That is, let be the density function of the learnable prior of . The optimization objective during training can be formulated as minimizing the expected training loss
(1) 
where are training examples,
is the training loss function and
is a constant hyperparameter for norm regularization encouraging the model to be sparse. Note that in practice optimizing this objective is intractable due to the discrete nature of and an exponential number of choices.The key to the method of Louizos et al. (2017), called the reparameterization trick, enables to be differentiable and jointly trained with the model parameter
. Specifically, the random variables
are relaxed as continuous variables distributed within the interval. In addition, instead of learning the probability density function
, the reparameterization trick proposes to learn the inverse of the cumulative density function (CDF). Note that if is the inverse of CDF for a variable , then can be easily sampled by first sampling and computing . Assuming the inverse CDF function is parameterized by some learnable parameters and the function is differentiable, we obtain an overall endtoend learning objective,(2) 
where
denotes the iid samples from the uniform distribution. Since
is now the output of the parameterized function and is used as an intermediate representation for the neural network (with ), gradient based optimization method can perform gradient updates for and .Following the work of Louizos et al. (2017), we choose the Hard Concrete distribution for the random variables . The inverse of CDF of this distribution is defined as follows
where and are two constants used to ‘stretch‘ the sigmoid outputs into the interval , and the final outputs are rectified into . The stretchandrectify process has the effect of assigning a good amount of probability mass on integer values
, which makes it a good relaxation of the binary (Bernoulli) distribution. During training, we sample
and compute and the loss for each training batch. The expected norm regularization can be separately computed via a close form(3) 
which is differentiable as well.
3.2 Structured pruning using factorization
A key choice is how we define parameter blocks to achieve the most effective pruning results. One obvious method is to allow each individual parameter weight to be independently pruned. While this method often retains very strong performance after pruning, it produces unstructured sparse parameter matrices which require custom hardware or sparse linear algebra primitives in order to achieve a decent computation speedup.
Recent work have adopted structured pruning as a remedy. Consider a fully connected layer which performs a multiplication for the input . One popular method corresponds to adding the sparsity variables as a sparse diagonal matrix to the multiplication, i.e., , where denotes the number of rows in x. This effectively removes a subset of columns of for column indices with . In practice, the structured pruning method can directly utilize the same dense linear algebra primitives (e.g. dense matrix multiplication) that are used in unpruned models. It also produces significant speedups at both training and inference time (by selecting a small subset of columns and performing multiplications given much smaller matrices). However, one limitation is that this structured pruning method tends to produce lower performance than its unstructured counterpart.
We propose a lowrank factorization of the weight matrix and optimize to prune rank1 components of the factorization. That is, we reparameterize and factorize the matrix into the product of two smaller matrices and , i.e., . Let be the number of columns of (or equivalently the number of rows of ), and be the th column of and th row of respectively. Since is the sum of rank1 components , we achieve structured pruning by introducing a pruning variable for each component
where is again the diagonal matrix of pruning variables. Intuitively, learning the factorization has the potential of keeping the most effective rank1 components, and thereby better preserve the model performance. In addition, after training, only columns and rows corresponding to nonzero diagonal values need to be stored, resulting in much smaller (but still dense) matrices and . The nonzero values of can be absorbed into either or . The computation boils down to a dense matrix multiply of two smaller matrices at inference time, maximizing efficiency on current hardware. Unlike unstructured pruning, we need not store the indices of the sparse weights, resulting in better memory savings.
3.3 Sparsity control using Augmented Lagrangian
The training objective (2) consists of an regularization to promote weight pruning. One limitation of this regularization is the lack of effective control on the size of the pruned model. For instance, we observe that training runs of the same could converge to very different model sizes when using slightly different learning rates or pruning schedules. This can be problematic because a desired model size or parameter budget is often needed in many realworld applications.
We make use of an Augmented Lagrangian method to overcome this training limitation. Let be the target model size and be the expected model size determined by the Hard Concrete parameter . Note can be computed based on Eq (3) by multiplying with the size of the th parameter block. The Augmented Lagrangian method imposes an equality constraint by introducing a violation penalty,
where are two Lagrangian multipliers that will be jointly updated during training. The overall training optimization is an adversarial game,
The updates of and would always increase the training loss unless the equality constraint is met, which in our case gives us the desired model size.
3.4 Implementation details
At the start of pruning, we gradually increase the target sparsity at a linear rate. That is, given the desired sparsity , we set the sparsity at th pruning iteration as
where
is a hyperparameter specifying the number of sparsity annealing steps.
During training, we compute the gradients with respect to , as well as the Lagrangian multipliers . We perform joint gradient updates for the parameters and Lagrangian multipliers at every iteration, but use and tune a different learning rate for Lagrangian multipliers. For each training batch, we sample the pruning mask and share it across the training examples within the batch. Since the pruning mask is shared, we can select parameters that are only active for the current batch and compute smaller matrix multiplications in forward and backward passes. This can result in training speedup when becomes sparse.
4 Results
Here we comprehensively benchmark the performance of our method on language modeling and classification tasks with different neural network architectures. Since FLOP targets the weight matrix of a fullyconnected (FC) layer, it in principle supports any architecture with FC layers. All training is performed using NVIDIA V100SXM2 GPUs. All inference timing measurements are done using a single thread on an Intel Xeon E52686 CPU @ 2.30GHz.
Parameters  FLOP  AGP (unstructured)  AGP (structured)  Dense Model  

35M  (100%)  1.24       
11M  (30%)  1.25  1.28  1.33  1.36 
7.6M  (20%)  1.27  1.30  1.36  1.40 
5.9M  (15%)  1.29  1.34  1.39  1.43 
4.2M  (10%)  1.33  1.39  1.46  1.48 
Parameters  FLOP  AGP (unstructured)  Dense Model  

41M  (100%)  1.10     
8.4M  (20%)  1.16  1.17  1.24 
5.3M  (10%)  1.19  1.17  1.36 
Parameters  Time (s)  Speedup  

35M  (100%)  0.39  1x 
11M  (30%)  0.23  1.7x 
7.6M  (20%)  0.21  1.9x 
5.9M  (15%)  0.20  2.0x 
4.2M  (10%)  0.18  2.2x 
4.1 Characterlevel language modeling
Dataset
We use the enwik8 dataset, one of the standard benchmarks for characterlevel language modeling. The dataset contains 100M bytes of data taken from Wikipedia. Following standard practice, we use the first 90M as training data and the remaining 10M for evaluation, split evenly as the development and test sets.
Setup
We evaluate FLOP and all baseline methods on two recent neural network architectures, SRU Lei et al. (2018) and TransformerXL Dai et al. (2019); Vaswani et al. (2017). We extend their implementation to support structured pruning. We reuse the training configurations and only tune the hyperparameters of the pruning methods.
We experiment with the two baseline methods:

Dense Model: directly trains dense (unpruned) models of smaller model sizes.

AGP (unstructured): one of the stateoftheart approaches which gradually prunes parameters based on the weight magnitude Zhu and Gupta (2017).

AGP (structured): the original AGP method prunes individual weights. We also experiment with another variant similar to our method by factorizing and controlling the sparsity of the diagonal matrix .
We use the existing implementation provided by the public Nervana Distiller library Zmora et al. (2018) for the AGP method. We conduct ablation analyses and report the results of additional pruning variants of FLOP in Section 5.
SRU results
Following the practice of Lei et al. (2018), we train a 6layer SRU model using a batch size of 64 and an unroll length of 256. We use a hidden size of 3056 and set the initial rank of the parameter matrices to 512. That is, we replace each weight matrix in SRU using an explicit factorization
with an inner dimension of 512. We train the model without pruning for 30 epochs as a model warmup, and start pruning for a maximum of 100 epochs.
Parameters  Time (s)  Speedup  

41M  (100%)  1.33  1x 
8.4M  (20%)  0.87  1.5x 
5.3M  (10%)  0.82  1.6x 
Table 1 (a) presents the results of FLOP as well as the baseline methods. The results conform to our expectations and to the results reported in previous work – pruning a large model is consistently better than training a small dense model from scratch. Furthermore, FLOP exceeds the performance of the unstructured AGP method at all sparsity levels tested. For instance, we achieve a loss of 0.01 bitspercharacter (BPC) (less than relative performance) using 30% of the parameters, while the AGP baseline has a loss of 0.04 BPC.
Parameters  SST2  MRPC  STSB  QNLI  Average  

125M  (100%)  92.43  90.9  90.22  89.77  90.83 
80M  (65%)  92.09  88.61  88.18  89.05  89.48 
FLOP can easily achieve significant computation speedup because of structured pruning. During training, FLOP obtains a training speedup ranging from 1.6x to 2.4x for the sparsity levels tested. As shown in Table 2
, similar speedups are observed at inference time using CPUs: 1.7x speedup at 70% sparsity and 2.2x at 90% sparsity. On the contrary, the computation of unstructured sparse matrices are harder to optimize. For models obtained using unstructured AGP, we experimented with the sparse matrix multiplication routine provided in Pytorch
Paszke et al. (2017) and a recent linear algebra compiler Kjolstad et al. (2017), but were unable to achieve a speedup.We further examine the breakdown in inference execution time in Figure 1. The computation of SRU is dominated by two operations, the matrix multiplication and the fused recurrent cell operation. As shown in the figure, the matrix multiplication is the main bottleneck before pruning, while the recurrent cell operation becomes the bottleneck after pruning. Indeed, the matrix multiplication time decreases linearly with the parameter count, highlighting the effectiveness of our structured pruning.
Transformer results
For the TransformerXL architecture, we use the 12layer model in Dai et al. (2019), consisting of 41M parameters in total. We introduce pruning for each of the key, query and value matrices in the selfattention layers, as well as in the feedforward layers. For factorization based pruning, we choose the starting rank for each weight matrix such that the total number of multiplications remain the same as the original unfactored model^{2}^{2}2In effect, we set , where are the dimensions of the original weight matrix.. We prune the TransformerXL model to 80% and 90% sparsity levels. Similar to the SRU model, we train smaller dense models that match (or exceed) the parameter count of the pruned model for each pruned model, by reducing the number of layers and/or the model/inner dimensions. Again, we use unstructured AGP as an additional baseline.
Table 1 (b) shows the pruning results. Again, both pruning methods significantly outperform training small dense models from scratch. Our method achieves results on par with the unstructured pruning baseline, being marginally worse at 90% sparsity but slightly better at 80% sparsity.
As shown in Table 3
, our pruned TransformerXL models achieve 1.51.6x inference speedup. The relative gain is smaller than that of SRU. This is because matrix multiplication only represents around 40 % of the total computation in TransformerXL inference, whereas the remainder is made up by mostly the softmax, layer norm and attention computations. Similar to SRU, we observe linear acceleration for matrix multiplication due to pruning, but the softmax and other computation dominate the inference time eventually. The breakdown of inference time is shown in Figure
1.Variants  Size  Sparsity  

0%  70%  80%  85%  90%  
37M  1.30  1.31 (0.8%)  1.34 (3.2%)  1.37 (5.4%)  1.43 (10.0%)  
66M  1.25  1.28 (2.4%)  1.31 (4.8%)  1.32 (5.6%)  1.37 (9.6%)  
35M  1.24  1.25 (0.8%)  1.27 (2.4%)  1.29 (4.0%)  1.33 (7.3%) 
4.2 Finetuning BERT on classification tasks
We further demonstrate that our method can also be applied to language model finetuning on downstream tasks. In this experiment, we use the RoBERTa base model Liu et al. (2019) which has recently achieved state of art performance across a variety of natural language understanding tasks.
Since the model was pretrained without matrix factorization, we first compute the singular value decomposition of each matrix in the network that we aim to prune. We then introduce the pruning mask in between the resulting factored matrices. Note that this procedure temporarily increases the total number of parameters. We compare here the final number of parameters to the initial number prefactorization.
Our results are shown in in Table 4. We are able to conserve nearly 99% of the performance while reducing the number of parameters by 35%. Our target sparsity level is limited by the fact that the embedding layers consist of a significant portion of the remaining parameters. We believe that higher levels of sparsity could be obtained by also factorizing the embedding layer, similar to Lan et al. (2019).
5 Analysis
In this section, we perform an analysis of several aspects of our method.
Factorization
Previous work has shown the effectiveness of the regularization by pruning input features, an approach we refer to as , where is the pruning mask. We hypothesize that pruning the input dimensions is a more restrictive form of pruning and show that our factorization strategy, , generally yields better results^{3}^{3}3In addition, if we prune the input dimensions directly, we will need to perform index select operations at inference time on the input (based on which input dimensions are needed for the current operation). This leads to slower inference..
To this end, we train the 6layer SRU models without weight matrix factorization and compare their performance against that with factorization. This gives us a model with hidden size 1536 if we set the total number of parameter similar to the original model. The original model has a hidden size of 3056 since lowrank factorization reduces the model size. To avoid unfair comparison, we also train a large model with hidden size containing 66M parameters in total. This model obtains 1.25 BPC which is on par with the original model used in previous experiments.
Table 5 summarizes the pruning performance between our factorization method and the previous input pruning method. We show the BPC at different sparsity levels and the relative loss of performance compared to the model with no pruning. These results are consistent with our hypothesis – factorization based pruning is able to retain relative model performance much more effectively than input feature pruning. Our method also achieves better absolute results while using less parameters.
Learning dynamics
Figure 2 demonstrates the training dynamics of the HardConcrete distribution. We plot the histogram of HardConcrete parameters after every few thousands of training iterations. A negative value of indicate the associated parameter is likely to be pruned while a positive value indicate the opposite. The magnitude of the value reflects the certainty of the pruning decision. As illustrated by the figure, the distribution of becomes bimodal after initial exploration. Certain parameters within each layer are completely pruned while others are kept with (almost) absolute certainty. In addition, the dynamics vary across different layers. For instance, for SRU the first recurrent layer gets pruned more aggressively than the last layer.
Sparsity at different layers
An natural question to ask is how pruning affects different parts of the network. We show in Figure 3 that layers closer to the final output tend to be pruned less aggressively. This effect is clearly visible for the SRU architecture. For the Transformer model, while a downwards trend is also visible, the correlation isn’t as strong, especially for selfattention layers.
The variability in the sparsity levels of different layers hint at a strength of the regularization method. The network is free to choose to allocate different parameter budgets to different layers. This is in contrast to most other pruning approaches where the sparsity level of each layer has to be specified Han et al. (2015); He et al. (2018). This could partly explain why our method is able to match or beat magnitudebased baselines in our experiments.
Impact of sparsity annealing
We found target sparsity annealing to be essential to good performance. Figure 4 shows the BPC given a few different numbers of annealing steps. We see that the run with the most annealing steps (i.e. 64K) exhibits a much smoother sparsity growing curve, and a clear improvement on BPC given the slower and smoother sparsification. This fits our intuition, as a neural network should be given sufficient time to explore and adjust to an increasing sparsity.
6 Conclusion
In this work, we present a novel structured pruning method based on lowrank factorization and regularization. We systematically evaluate the performance of this method on large language models. We show that our method can provide significant speedups and compression rates on large stateoftheart models while losing minimal performance, compared to unstructured magnitude pruning.
This work contributes to reducing the growing overhead of large language models, and shines a light on the role of model capacity in language modeling. In particular, we show that it is possible to build small models of very high performance through compression, which vastly outperform models of the same size trained from scratch. This suggests that the success of large language models is not only due to a higher model capacity but also to better optimization Melis et al. (2018).
References
 Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §2.
 Interpretable neural predictions with differentiable binary variables. arXiv preprint arXiv:1905.08160. Cited by: §1, §3.3.
 Efficient and effective sparse lstm on fpga with bankbalanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 63–72. Cited by: §1, §2.
 Transformer to cnn: labelscarce distillation for efficient text classification. arXiv preprint arXiv:1909.03508. Cited by: §2.
 Transformerxl: attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860. Cited by: §1, §4.1, §4.1.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §2.
 Gpu kernels for blocksparse weights. arXiv preprint arXiv:1711.09224. Cited by: §1.
 EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: §1.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2, §5.

Amc: automl for model compression and acceleration on mobile devices.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 784–800. Cited by: §5.  Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.

The tensor algebra compiler
. Proceedings of the ACM on Programming Languages 1 (OOPSLA), pp. 77. Cited by: §4.1. 
ALBERT: a lite bert for selfsupervised learning of language representations
. arXiv preprint arXiv:1909.11942. Cited by: §4.2. 
Simple recurrent units for highly parallelizable recurrence.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 4470–4481. Cited by: §4.1, §4.1.  Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.2.
 Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §1, §2, §3.1, §3.1, §3.
 An augmented lagrangian approach to constrained map inference. Cited by: §3.3.
 On the state of the art of evaluation in neural language models. In International Conference on Learning Representations, External Links: Link Cited by: §6.

Variational dropout sparsifies deep neural networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 2498–2507. Cited by: §2. 
Exploring sparsity in recurrent neural networks
. arXiv preprint arXiv:1704.05119. Cited by: §1, §2.  Blocksparse recurrent neural networks. arXiv preprint arXiv:1711.02782. Cited by: §1, §2.
 Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §4.1.
 Analyzing multihead selfattention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5797–5808. External Links: Document Cited by: §2.

Learning intrinsic sparse structures within long shortterm memory
. arXiv preprint arXiv:1709.05027. Cited by: §1, §2. 
Balanced sparsity for efficient dnn inference on gpu.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 5676–5683. Cited by: §1, §2.  To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §1, §1, §2, §2, 2nd item.
 Neural network distiller. External Links: Document, Link Cited by: §4.1.
Comments
There are no comments yet.