Structured Pruning of Large Language Models

10/10/2019
by   Ziheng Wang, et al.
MIT
ASAPP INC
0

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a novel, structured pruning approach based on low rank factorization and augmented Lagrangian L0 norm regularization. Our structured approach achieves significant inference speedups while matching or outperforming our unstructured pruning baseline at various sparsity levels. We apply our method to state of the art models on the enwiki8 dataset and obtain a 1.19 perplexity score with just 5M parameters, vastly outperforming a model of the same size trained from scratch. We also demonstrate that our method can be applied to language model fine-tuning by pruning the BERT model on several downstream classification benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

05/15/2020

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Magnitude pruning is a widely used strategy for reducing model size in p...
02/14/2021

Error-driven Pruning of Language Models for Virtual Assistants

Language models (LMs) for virtual assistants (VAs) are typically trained...
10/18/2021

BERMo: What can BERT learn from ELMo?

We propose BERMo, an architectural modification to BERT, which makes pre...
01/28/2021

Combining pre-trained language models and structured knowledge

In recent years, transformer-based language models have achieved state o...
09/18/2021

Structured Pattern Pruning Using Regularization

Iterative Magnitude Pruning (IMP) is a network pruning method that repea...
04/20/2018

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Many efforts have been made to facilitate natural language processing ta...
08/04/2021

How to Query Language Models?

Large pre-trained language models (LMs) are capable of not only recoveri...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in language modeling have led to remarkable improvements on a variety of natural language tasks. These models, however, have grown increasingly large Dai et al. (2019), rendering them slow and costly. Through the use of model compression, we aim to reduce this overhead, and to better understand the role of model capacity in language models.

A common approach to model compression is known as weight pruning Zhu and Gupta (2017); Han et al. (2015). Model weights are progressively removed, resulting in sparse matrices across the network. Earlier work focuses mostly on unstructured pruning, where weights are pruned individually Narang et al. (2017a); Zhu and Gupta (2017). While this method is effective, it results in unstructured sparse matrices that are difficult to support on common hardware Han et al. (2016), making it challenging to obtain inference speedups, despite a significant reduction in model size.

On the other hand, structured pruning Narang et al. (2017b); Wen et al. (2017); Cao et al. (2019); Yao et al. (2019) imposes highly structured sparse weight matrices that can either directly use optimized dense linear algebra primitives or admit efficient implementations Gray et al. (2017); Yao et al. (2019). These techniques lead to significant speedup but tend to give lower performance than unstructured pruning Yao et al. (2019) with the same parameter budget, due to imposing larger constraints on the pruning process.

In order to alleviate these constraints, we propose a novel structured pruning technique, based on low-rank factorization and norm regularization Louizos et al. (2017). The low-rank factorization allows us to retain the dense structure of the matrices, while the regularization relaxes the constraints imposed from structured pruning, by allowing the network to choose which weights to remove. We factorize the weight matrices into the product of two smaller matrices, and set a diagonal mask between these two matrices. We prune the mask during training via regularization, and use an augmented Lagrangian approach inspired by Bastings et al. (2019) to control the final sparsity level of the model. Our method, which we refer to as FLOP (Factorized L0 Pruning), is generic, and can be applied to any matrix multiplication.

Experimental results on language modeling and language understanding tasks with recurrent and Transformer Vaswani et al. (2017) architectures indicate that our method either outperforms or matches the performance of state of the art unstructured pruning Zhu and Gupta (2017), while also providing up to 2x speedup at inference. Our results also demonstrate that pruning larger models yields much higher performance than training a smaller model from scratch, shining a light on the role of model capacity in language modeling.

2 Related Work

Model compression has three main categories: weight pruning Narang et al. (2017a); Zhu and Gupta (2017). knowledge distillation Ba and Caruana (2014); Hinton et al. (2015), and quantization Han et al. (2015). Our work is focused on weight pruning, and is compatible with these other methods.

Most previous work only considers unstructured pruning based on magnitude Zhu and Gupta (2017); Frankle and Carbin (2019), or through variational dropout Gale et al. (2019); Molchanov et al. (2017). Our method aims to prune weights in a structured manner. Louizos et al. (2017) proposes a relaxation of the regularization. We modify this method by first factorizing the weight matrices, and second, by using an Augmented Lagrangian method to control and anneal the target sparsity. Other works also attempt structured pruning but do not consider the regularization approach Narang et al. (2017b); Wen et al. (2017); Cao et al. (2019); Yao et al. (2019). Voita et al. (2019) also uses regularization but prunes full attention heads in transformer models on machine translation benchmarks. Our method generalizes to any matrix multiplication through factorization and leverages Augmented Lagrangian methods to reach the target sparsity.

Recently there has been efforts in trying to compress the BERT model on downstream tasks. Such methods include knowledge distillation Chia et al. (2019). We show that weight pruning is also a viable option, and leave it to future work to combine these different methods.

3 Method

We begin by formulating model pruning as an end-to-end learning problem, following the prior work of Louizos et al. (2017). In the subsequent sections, we introduce two novel revisions over this method, providing improved pruning performance and explicitly controlled model size after pruning.

3.1 Pruning via  norm regularization

Consider a given neural network model

parameterized by , where each represents an individual parameter weight or a block of weights (e.g. a column of a weight matrix) and

denotes the number of blocks. A pruning strategy of the model can be parameterized by introducing additional binary variables

such that and

Here denotes the set of model parameters after pruning and its  norm, , measures the effective size of the pruned model.

The choice of binary variables can be regulated by some prior distribution and optimized given the training data. That is, let be the density function of the learnable prior of . The optimization objective during training can be formulated as minimizing the expected training loss

(1)

where are training examples,

is the training loss function and

is a constant hyper-parameter for  norm regularization encouraging the model to be sparse. Note that in practice optimizing this objective is intractable due to the discrete nature of and an exponential number of choices.

The key to the method of Louizos et al. (2017), called the re-parameterization trick, enables to be differentiable and jointly trained with the model parameter

. Specifically, the random variables

are relaxed as continuous variables distributed within the interval

. In addition, instead of learning the probability density function

, the re-parameterization trick proposes to learn the inverse of the cumulative density function (CDF). Note that if is the inverse of CDF for a variable , then can be easily sampled by first sampling and computing . Assuming the inverse CDF function is parameterized by some learnable parameters and the function is differentiable, we obtain an overall end-to-end learning objective,

(2)

where

denotes the iid samples from the uniform distribution. Since

is now the output of the parameterized function and is used as an intermediate representation for the neural network (with ), gradient based optimization method can perform gradient updates for and .

Following the work of Louizos et al. (2017), we choose the Hard Concrete distribution for the random variables . The inverse of CDF of this distribution is defined as follows

where and are two constants used to ‘stretch‘ the sigmoid outputs into the interval , and the final outputs are rectified into . The stretch-and-rectify process has the effect of assigning a good amount of probability mass on integer values

, which makes it a good relaxation of the binary (Bernoulli) distribution. During training, we sample

and compute and the loss for each training batch. The expected  norm regularization can be separately computed via a close form

(3)

which is differentiable as well.

3.2 Structured pruning using factorization

A key choice is how we define parameter blocks to achieve the most effective pruning results. One obvious method is to allow each individual parameter weight to be independently pruned. While this method often retains very strong performance after pruning, it produces unstructured sparse parameter matrices which require custom hardware or sparse linear algebra primitives in order to achieve a decent computation speed-up.

Recent work have adopted structured pruning as a remedy. Consider a fully connected layer which performs a multiplication for the input . One popular method corresponds to adding the sparsity variables as a sparse diagonal matrix to the multiplication, i.e., , where denotes the number of rows in x. This effectively removes a subset of columns of for column indices with . In practice, the structured pruning method can directly utilize the same dense linear algebra primitives (e.g. dense matrix multiplication) that are used in unpruned models. It also produces significant speedups at both training and inference time (by selecting a small subset of columns and performing multiplications given much smaller matrices). However, one limitation is that this structured pruning method tends to produce lower performance than its unstructured counterpart.

We propose a low-rank factorization of the weight matrix and optimize to prune rank-1 components of the factorization. That is, we reparameterize and factorize the matrix into the product of two smaller matrices and , i.e., . Let be the number of columns of (or equivalently the number of rows of ), and be the -th column of and -th row of respectively. Since is the sum of rank-1 components , we achieve structured pruning by introducing a pruning variable for each component

where is again the diagonal matrix of pruning variables. Intuitively, learning the factorization has the potential of keeping the most effective rank-1 components, and thereby better preserve the model performance. In addition, after training, only columns and rows corresponding to non-zero diagonal values need to be stored, resulting in much smaller (but still dense) matrices and . The nonzero values of can be absorbed into either or . The computation boils down to a dense matrix multiply of two smaller matrices at inference time, maximizing efficiency on current hardware. Unlike unstructured pruning, we need not store the indices of the sparse weights, resulting in better memory savings.

3.3 Sparsity control using Augmented Lagrangian

The training objective (2) consists of an  regularization to promote weight pruning. One limitation of this regularization is the lack of effective control on the size of the pruned model. For instance, we observe that training runs of the same could converge to very different model sizes when using slightly different learning rates or pruning schedules. This can be problematic because a desired model size or parameter budget is often needed in many real-world applications.

We make use of an Augmented Lagrangian method to overcome this training limitation. Let be the target model size and be the expected model size determined by the Hard Concrete parameter . Note can be computed based on Eq (3) by multiplying with the size of the -th parameter block. The Augmented Lagrangian method imposes an equality constraint by introducing a violation penalty,

where are two Lagrangian multipliers that will be jointly updated during training. The overall training optimization is an adversarial game,

The updates of and would always increase the training loss unless the equality constraint is met, which in our case gives us the desired model size.

Similar (and other) Lagrangian relaxation methods have been explored in other NLP problems Bastings et al. (2019); Martins et al. (2011). We adopt a quadratic penalty variant and demonstrate its effectiveness for structured pruning.

3.4 Implementation details

At the start of pruning, we gradually increase the target sparsity at a linear rate. That is, given the desired sparsity , we set the sparsity at -th pruning iteration as

where

is a hyperparameter specifying the number of sparsity annealing steps.

During training, we compute the gradients with respect to , as well as the Lagrangian multipliers . We perform joint gradient updates for the parameters and Lagrangian multipliers at every iteration, but use and tune a different learning rate for Lagrangian multipliers. For each training batch, we sample the pruning mask and share it across the training examples within the batch. Since the pruning mask is shared, we can select parameters that are only active for the current batch and compute smaller matrix multiplications in forward and backward passes. This can result in training speedup when becomes sparse.

4 Results

Here we comprehensively benchmark the performance of our method on language modeling and classification tasks with different neural network architectures. Since FLOP targets the weight matrix of a fully-connected (FC) layer, it in principle supports any architecture with FC layers. All training is performed using NVIDIA V100-SXM2 GPUs. All inference timing measurements are done using a single thread on an Intel Xeon E5-2686 CPU @ 2.30GHz.

Parameters     FLOP AGP (unstructured) AGP (structured) Dense Model
35M (100%) 1.24 - - -
11M (30%) 1.25 1.28 1.33 1.36
7.6M (20%) 1.27 1.30 1.36 1.40
5.9M (15%) 1.29 1.34 1.39 1.43
4.2M (10%) 1.33 1.39 1.46 1.48
(a)   SRU
Parameters FLOP AGP (unstructured) Dense Model
41M (100%) 1.10 - -
8.4M (20%) 1.16 1.17 1.24
5.3M (10%) 1.19 1.17 1.36
(b)   Transformer-XL
Table 1: Bits-per-character (BPC) at difference sparsity levels for (a) the SRU model and (b) the Transformer-XL model. Lower number is better. Our structured pruning approach either outperforms or matches the performance of unstructured pruning, and significantly outperforms smaller dense models trained from scratch.
Parameters Time (s) Speedup
35M (100%) 0.39 1x
11M (30%) 0.23 1.7x
7.6M (20%) 0.21 1.9x
5.9M (15%) 0.20 2.0x
4.2M (10%) 0.18 2.2x
Table 2: Inference timing measurements for the SRU model.

4.1 Character-level language modeling

Dataset

We use the enwik8 dataset, one of the standard benchmarks for character-level language modeling. The dataset contains 100M bytes of data taken from Wikipedia. Following standard practice, we use the first 90M as training data and the remaining 10M for evaluation, split evenly as the development and test sets.

Setup

We evaluate FLOP and all baseline methods on two recent neural network architectures, SRU Lei et al. (2018) and Transformer-XL Dai et al. (2019); Vaswani et al. (2017). We extend their implementation to support structured pruning. We re-use the training configurations and only tune the hyper-parameters of the pruning methods.

We experiment with the two baseline methods:

  • Dense Model: directly trains dense (unpruned) models of smaller model sizes.

  • AGP (unstructured): one of the state-of-the-art approaches which gradually prunes parameters based on the weight magnitude Zhu and Gupta (2017).

  • AGP (structured): the original AGP method prunes individual weights. We also experiment with another variant similar to our method by factorizing and controlling the sparsity of the diagonal matrix .

We use the existing implementation provided by the public Nervana Distiller library Zmora et al. (2018) for the AGP method. We conduct ablation analyses and report the results of additional pruning variants of FLOP in Section 5.

SRU results

Following the practice of Lei et al. (2018), we train a 6-layer SRU model using a batch size of 64 and an unroll length of 256. We use a hidden size of 3056 and set the initial rank of the parameter matrices to 512. That is, we replace each weight matrix in SRU using an explicit factorization

with an inner dimension of 512. We train the model without pruning for 30 epochs as a model warmup, and start pruning for a maximum of 100 epochs.

Parameters Time (s) Speedup
41M (100%) 1.33 1x
8.4M (20%) 0.87 1.5x
5.3M (10%) 0.82 1.6x
Table 3: Inference timing measurements for Transformer XL model.

Table 1 (a) presents the results of FLOP as well as the baseline methods. The results conform to our expectations and to the results reported in previous work – pruning a large model is consistently better than training a small dense model from scratch. Furthermore, FLOP exceeds the performance of the unstructured AGP method at all sparsity levels tested. For instance, we achieve a loss of 0.01 bits-per-character (BPC) (less than relative performance) using 30% of the parameters, while the AGP baseline has a loss of 0.04 BPC.

Figure 1: Inference time breakdown between different computations in the SRU (left) and Transformer-XL (right) models for different model sizes.
    Parameters    SST2    MRPC    STS-B    QNLI    Average
125M (100%) 92.43 90.9 90.22 89.77 90.83
80M (65%) 92.09 88.61 88.18 89.05 89.48
Table 4: Compression on downstream fine-tuning

FLOP can easily achieve significant computation speedup because of structured pruning. During training, FLOP obtains a training speedup ranging from 1.6x to 2.4x for the sparsity levels tested. As shown in Table 2

, similar speedups are observed at inference time using CPUs: 1.7x speedup at 70% sparsity and 2.2x at 90% sparsity. On the contrary, the computation of unstructured sparse matrices are harder to optimize. For models obtained using unstructured AGP, we experimented with the sparse matrix multiplication routine provided in Pytorch

Paszke et al. (2017) and a recent linear algebra compiler Kjolstad et al. (2017), but were unable to achieve a speedup.

We further examine the breakdown in inference execution time in Figure 1. The computation of SRU is dominated by two operations, the matrix multiplication and the fused recurrent cell operation. As shown in the figure, the matrix multiplication is the main bottleneck before pruning, while the recurrent cell operation becomes the bottleneck after pruning. Indeed, the matrix multiplication time decreases linearly with the parameter count, highlighting the effectiveness of our structured pruning.

Transformer results

For the Transformer-XL architecture, we use the 12-layer model in Dai et al. (2019), consisting of 41M parameters in total. We introduce pruning for each of the key, query and value matrices in the self-attention layers, as well as in the feed-forward layers. For factorization based pruning, we choose the starting rank for each weight matrix such that the total number of multiplications remain the same as the original unfactored model222In effect, we set , where are the dimensions of the original weight matrix.. We prune the Transformer-XL model to 80% and 90% sparsity levels. Similar to the SRU model, we train smaller dense models that match (or exceed) the parameter count of the pruned model for each pruned model, by reducing the number of layers and/or the model/inner dimensions. Again, we use unstructured AGP as an additional baseline.

Table 1 (b) shows the pruning results. Again, both pruning methods significantly outperform training small dense models from scratch. Our method achieves results on par with the unstructured pruning baseline, being marginally worse at 90% sparsity but slightly better at 80% sparsity.

As shown in Table 3

, our pruned Transformer-XL models achieve 1.5-1.6x inference speedup. The relative gain is smaller than that of SRU. This is because matrix multiplication only represents around 40 % of the total computation in Transformer-XL inference, whereas the remainder is made up by mostly the softmax, layer norm and attention computations. Similar to SRU, we observe linear acceleration for matrix multiplication due to pruning, but the softmax and other computation dominate the inference time eventually. The breakdown of inference time is shown in Figure 

1.

Variants Size Sparsity
     0% 70% 80% 85% 90%
37M 1.30 1.31 (-0.8%) 1.34 (-3.2%) 1.37 (-5.4%) 1.43 (-10.0%)
66M 1.25 1.28 (-2.4%) 1.31 (-4.8%) 1.32 (-5.6%) 1.37 (-9.6%)
35M 1.24 1.25 (-0.8%) 1.27 (-2.4%) 1.29 (-4.0%) 1.33 (-7.3%)
Table 5: Comparison between factorization-based pruning () and input feature pruning () using the 6-layer SRU model. We show the byte per character (BPC) at different sparsity levels and the relative loss of performance compared to the unpruned model. Our approach results in less decrease in relative performance.
Figure 2: Histograms of HardConcrete parameters during training. We show the changes of histograms for the first SRU layer (left figure) and the last layer (right figure). We compute the histogram every 3,000 training steps.

4.2 Fine-tuning BERT on classification tasks

We further demonstrate that our method can also be applied to language model fine-tuning on downstream tasks. In this experiment, we use the RoBERTa base model Liu et al. (2019) which has recently achieved state of art performance across a variety of natural language understanding tasks.

Since the model was pretrained without matrix factorization, we first compute the singular value decomposition of each matrix in the network that we aim to prune. We then introduce the pruning mask in between the resulting factored matrices. Note that this procedure temporarily increases the total number of parameters. We compare here the final number of parameters to the initial number pre-factorization.

Our results are shown in in Table 4. We are able to conserve nearly 99% of the performance while reducing the number of parameters by 35%. Our target sparsity level is limited by the fact that the embedding layers consist of a significant portion of the remaining parameters. We believe that higher levels of sparsity could be obtained by also factorizing the embedding layer, similar to Lan et al. (2019).

5 Analysis

In this section, we perform an analysis of several aspects of our method.

Factorization

Previous work has shown the effectiveness of the  regularization by pruning input features, an approach we refer to as , where is the pruning mask. We hypothesize that pruning the input dimensions is a more restrictive form of pruning and show that our factorization strategy, , generally yields better results333In addition, if we prune the input dimensions directly, we will need to perform index select operations at inference time on the input (based on which input dimensions are needed for the current operation). This leads to slower inference..

To this end, we train the 6-layer SRU models without weight matrix factorization and compare their performance against that with factorization. This gives us a model with hidden size 1536 if we set the total number of parameter similar to the original model. The original model has a hidden size of 3056 since low-rank factorization reduces the model size. To avoid unfair comparison, we also train a large model with hidden size containing 66M parameters in total. This model obtains 1.25 BPC which is on par with the original model used in previous experiments.

Table 5 summarizes the pruning performance between our factorization method and the previous input pruning method. We show the BPC at different sparsity levels and the relative loss of performance compared to the model with no pruning. These results are consistent with our hypothesis – factorization based pruning is able to retain relative model performance much more effectively than input feature pruning. Our method also achieves better absolute results while using less parameters.

Learning dynamics

Figure 2 demonstrates the training dynamics of the HardConcrete distribution. We plot the histogram of HardConcrete parameters after every few thousands of training iterations. A negative value of indicate the associated parameter is likely to be pruned while a positive value indicate the opposite. The magnitude of the value reflects the certainty of the pruning decision. As illustrated by the figure, the distribution of becomes bi-modal after initial exploration. Certain parameters within each layer are completely pruned while others are kept with (almost) absolute certainty. In addition, the dynamics vary across different layers. For instance, for SRU the first recurrent layer gets pruned more aggressively than the last layer.

Figure 3: Top: sparsity by layer for the SRU architecture at different sparsity levels. Bottom: sparsity by layer and layer type for the 41M parameter Transformer-XL architecture pruned to 90% sparsity.


  

Figure 4: Comparison between different numbers of sparsity annealing steps.

Sparsity at different layers

An natural question to ask is how pruning affects different parts of the network. We show in Figure 3 that layers closer to the final output tend to be pruned less aggressively. This effect is clearly visible for the SRU architecture. For the Transformer model, while a downwards trend is also visible, the correlation isn’t as strong, especially for self-attention layers.

The variability in the sparsity levels of different layers hint at a strength of the  regularization method. The network is free to choose to allocate different parameter budgets to different layers. This is in contrast to most other pruning approaches where the sparsity level of each layer has to be specified Han et al. (2015); He et al. (2018). This could partly explain why our method is able to match or beat magnitude-based baselines in our experiments.

Impact of sparsity annealing

We found target sparsity annealing to be essential to good performance. Figure 4 shows the BPC given a few different numbers of annealing steps. We see that the run with the most annealing steps (i.e. 64K) exhibits a much smoother sparsity growing curve, and a clear improvement on BPC given the slower and smoother sparsification. This fits our intuition, as a neural network should be given sufficient time to explore and adjust to an increasing sparsity.

6 Conclusion

In this work, we present a novel structured pruning method based on low-rank factorization and regularization. We systematically evaluate the performance of this method on large language models. We show that our method can provide significant speedups and compression rates on large state-of-the-art models while losing minimal performance, compared to unstructured magnitude pruning.

This work contributes to reducing the growing overhead of large language models, and shines a light on the role of model capacity in language modeling. In particular, we show that it is possible to build small models of very high performance through compression, which vastly outperform models of the same size trained from scratch. This suggests that the success of large language models is not only due to a higher model capacity but also to better optimization Melis et al. (2018).

References

  • J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §2.
  • J. Bastings, W. Aziz, and I. Titov (2019) Interpretable neural predictions with differentiable binary variables. arXiv preprint arXiv:1905.08160. Cited by: §1, §3.3.
  • S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, and L. Zhang (2019) Efficient and effective sparse lstm on fpga with bank-balanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 63–72. Cited by: §1, §2.
  • Y. K. Chia, S. Witteveen, and M. Andrews (2019) Transformer to cnn: label-scarce distillation for efficient text classification. arXiv preprint arXiv:1909.03508. Cited by: §2.
  • Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §1, §4.1, §4.1.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • T. Gale, E. Elsen, and S. Hooker (2019) The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §2.
  • S. Gray, A. Radford, and D. P. Kingma (2017) Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224. Cited by: §1.
  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: §1.
  • S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2, §5.
  • Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 784–800. Cited by: §5.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe (2017)

    The tensor algebra compiler

    .
    Proceedings of the ACM on Programming Languages 1 (OOPSLA), pp. 77. Cited by: §4.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    arXiv preprint arXiv:1909.11942. Cited by: §4.2.
  • T. Lei, Y. Zhang, S. I. Wang, H. Dai, and Y. Artzi (2018) Simple recurrent units for highly parallelizable recurrence. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 4470–4481. Cited by: §4.1, §4.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.2.
  • C. Louizos, M. Welling, and D. P. Kingma (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §1, §2, §3.1, §3.1, §3.
  • A. F. Martins, M. A. Figeuiredo, P. M. Aguiar, N. A Smith, and E. P Xing (2011) An augmented lagrangian approach to constrained map inference. Cited by: §3.3.
  • G. Melis, C. Dyer, and P. Blunsom (2018) On the state of the art of evaluation in neural language models. In International Conference on Learning Representations, External Links: Link Cited by: §6.
  • D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 2498–2507. Cited by: §2.
  • S. Narang, E. Elsen, G. Diamos, and S. Sengupta (2017a)

    Exploring sparsity in recurrent neural networks

    .
    arXiv preprint arXiv:1704.05119. Cited by: §1, §2.
  • S. Narang, E. Undersander, and G. Diamos (2017b) Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782. Cited by: §1, §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §4.1.
  • E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5797–5808. External Links: Document Cited by: §2.
  • W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li (2017)

    Learning intrinsic sparse structures within long short-term memory

    .
    arXiv preprint arXiv:1709.05027. Cited by: §1, §2.
  • Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie (2019) Balanced sparsity for efficient dnn inference on gpu. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 5676–5683. Cited by: §1, §2.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §1, §1, §2, §2, 2nd item.
  • N. Zmora, G. Jacob, and G. Novik (2018) Neural network distiller. External Links: Document, Link Cited by: §4.1.