Transformer-based neural networks[vaswani2017attention] have been widely used for their great capability to capture long-range contextual relationships. Transformer models have achieved state-of-the-art performance in various tasks using sequential data, such as language modeling (LM) [dai2019transformer, sukhbaatar2019adaptive] or language representation learning [devlin2019bert, brown2020language].
Transformer layer consists of two sub-modules: a multihead attention module (MHA) followed by a feedforward module (FF). Both components behave similarly in that they transform representations, but the information they use is dearly distinct. MHA extracts features based on the relationship between sequential inputs while FF transforms the feature irrespective of its relative location and value. In MHA, the connection between each input is measured from multiple perspectives by dividing the features into several attention heads. It has been reported that each head focuses on different parts of the sequence [clark2019does]. Concurrently, it has been also presented that a considerable number of heads can be removed without performance loss [voita2019analyzing, michel2019sixteen, hao2021self].
Despite its excellent performance, the computational cost and the parameter size of Transformer are considerably large. Attention head pruning is a promising method for reducing both. This is a structured pruning approach, so the effect of pruning can be well reflected in practical usage on modern devices, in contrast to unstructured pruning. However, the benefit only applies to MHA because FF is not affected by the number of heads, while FF often takes approximately 2/3 of the parameters and half of the computations (depending on the sequence length and model configuration). To further extend the ability to compress Transformer models with attention head pruning, we adopt the recently introduced All-attention [sukhbaatar2019augmenting] Transformer, which adds persistent memory blocks inside MHA, instead of FF. We denote All-attention Transformer as All-att for simplicity.
All-att unifies two sub-modules in the original Transformer and split almost every computation under the multihead path, which is a desirable characteristic for attention head pruning. Figure 1 demonstrates the advantage of the attention head pruning on All-att compared to a Transformer-XL (TXL) [dai2019transformer], which is a widely adopted model for LM. For example, in 50% head sparsity, TXL computes approximately 73% of full multiply-accumulate operations (MAC) and maintains 81% of the parameters, whereas All-att only requires 50% of the load and 50% of the parameters.
In pruning attention heads, we utilize a trainable method so the model can jointly learn which heads can be pruned out while preserving the performance. Specifically, we attach auxiliary gating parameters on each layer, inspired by earlier works [voita2019analyzing, bejnordi2019batch]. Although All-att shows comparable performance to the original Transformer in LM, removing each attention head of All-att is directly connected to losing the information inside persistent memory which replaces the role of the FF. We identify several difficulties in the pruning process; severe instability at the initial stage, consistent increase of the training loss, overly sparse heads, and significant performance drop of the pruned model. Therefore, we propose three techniques that modify the pruning process to solve these problems: (1) sparsity loss warm-up, (2) proper initialization, and (3) attention output scaling.
Our main contributions are summarized as follows. First, we adopt All-att to fully utilize the advantages of attention head pruning. Second, we propose advanced training techniques to minimize the damage to the performance of the pruned model and stabilize the pruning process. We demonstrate that our pruned All-att model shows consistently lower perplexity for word-level LM and lower bit-per-character for character-level LM, compared to the original Transformer model of a comparable parameter size.
2 Related Work
Pruning on Transformer has been widely studied. Research on unstructured pruning [guo2020parameter, sanh2020movement] shows that several parameters can be removed without a significant effect on the final performance. However, the unstructured nature is practically difficult to take advantage of the actual speedup without specialized hardware support [wang2021spatten].
Several studies have focused on attention head removal, which is a structured and GPU-friendly approach. The most adopted method begins from a fully converged pretrained model and prunes out attention heads during additional training steps. For example, in [voita2019analyzing], trainable gating parameters are attached to each head and regularized with loss. Other types of head pruning have also been proposed; without additional parameters, in [michel2019sixteen], the sensitivity of each head to the loss is used as a proxy for importance. A single-shot meta-pruner [zhang2021know]
is introduced in which a small convolutional neural network is trained to select heads that contribute to maintaining the attention distribution. Because earlier studies on attention head pruning do not compress FF, additional effort is needed to further reduce the computation and parameters of FF for the original Transformer.
3 Attention Head Pruning for LM
3.1 All-Attention Transformer
All-att adds a set of trainable parameters, named as persistent vectors, instead of FF. These persistent vectors perform as an external key and value for Transformer but do not depend on the input. Figure 2 illustrates an All-att layer. For simplicity, we omit the relative positional encoding and its projection in equations and the figure.
When All-att architecture is used for LM, the memory caching algorithm of TXL is adopted. The hidden representation computed for the previous sequence segment is cached as memory and used as an external source of information. This memory mechanism enables much longer context, which is highly beneficial for LM. Consider a sequence of-dimensional vector and a corresponding memory . The query (), key (), and value () of -th head is calculated as . The concatenation operator is noted as . For the entire model, heads are used per layer and layers are stacked.
The persistent vectors are realized as trainable -dimensional vectors for each head, where is the head dimension. and represent the persistent key and value vectors of the -th head. Every query in the sequence treats and as extensions of and , respectively. The output of the -th head is calculated as:
The outputs from multiple attention heads are concatenated and projected to produce the final result.
By setting the number of persistent vectors same as the internal dimension of FF, the number of parameters of All-att becomes almost identical to that of the original Transformer (both MHA and FF).
3.2 Head Pruning by Gating
For pruning, we attach a set of trainable head gating parameters () to each layer. The parameters pass through BinConcrete [louizos2018learning] function and converted to stochastic discrete Bernoulli gate (). The final projection is modified as follows:
To avoid division by zero (if all gates are sampled to 0), we clip to the maximum value of . Because we can easily absorb in the , the scaling does not require additional computation for the inference.
In addition to the default negative log-likelihood (nll) loss, we utilize additional (sparsity) loss111Please refer to [louizos2018learning] for loss and BinConcrete function. to encourage higher sparsity explicitly. The overall loss is a weighted sum of both: . The weighting coefficient controls the final sparsity. When the -th head is decided to be pruned (), we remove the parameters corresponding to the head . Concurrently, their corresponding computations are removed.
3.3 Techniques for Head Pruning
Pruning begins from the converged model that is previously trained without an augmented gating mechanism; therefore, the addition of attention head gating excessively changes the activation statistics and training dynamics. When this discrepancy is combined with the unique characteristics of All-att, we observe a significant performance drop and a severe instability of the pruning process, particularly in the initial training phase. To overcome the difficulties, we introduce three techniques to overcome the difficulties of pruning on All-att models.
First, we linearly increase the sparsity loss coefficient from zero to the desired value. Gradual increase of prevents the loss to overly disturb the network adapting to the stochastic activation at the beginning of the pruning process. Note that the objective is a powerful pressure that can be always achieved by decreasing the gating parameter values, which leads to the consistent increase of the sparsity.
Second, we initialize gating parameters to a large positive value to bias the sampled stochastic gates to be opened (
) at the beginning. The zero initialization opens a gate with only 50% probability. In that case, upper layers only receive abruptly reduced information and quickly loss the existing well-trained internal structure. We initializeto 2, which takes about an 88% probability of gate to be opened.
Third, as expressed in Eq.(3), we scale the output inversely proportional to the . The scaling factor compensates for the masked portion and maintains the statistics after gating is applied. Recently, attention head dropout [zhou2020scheduled, zhang2021stochastic] has been introduced with similar scaling, but their scaling is used for regularization during training. We found that this technique greatly stabilizes the training dynamics, especially after the training is stabilized by the above two methods. Without output scaling, we observe a consistent increase in the training loss.
4 Experimental Results
4.1.1 Datasets and Model Architecture
We evaluate the performance on WikiText-103 [merity2016pointer]
word-level LM and Text8[text8] character-level LM benchmarks. The pre-processing of datasets follows common practice [dai2019transformer]. The performance is reported in perplexity (ppl) for WikiText-103 and bit-per-character (bpc) for Text8. Lower is better for both ppl and bpc.
|Sparsity (%)||#Params(w/o emb.)||ppl|
|Sparsity (%)||#Params(w/o emb.)||bpc|
|- proper gate initialization||(O,X,O)||+0.61|
|- attention head output scaling||(O,O,X)||+1.25|
|- all (vanilla)||(X,X,X)||+1.48|
The baseline model is a variant of All-attention Transformer. The model adopts a pre-norm instead of a post-norm and omits the adaptive-span [sukhbaatar2019adaptive] mechanism. The configuration of the transformer layer is as follows: the hidden dimension , number of heads , and number of persistent vectors . We stack 16 layers for WikiText-103 and 12 layers for Text8.
4.1.2 Training Details
We first train the baseline model with full attention heads. We utilize the LAMB optimizer with a batch size of 96 for WikiText-103 and 64 for Text8. The sequence length and memory length are both set to 192 for WikiText-103 and 512 for Text8. We apply linear warm-up on learning rate for 4K iterations. The learning rate increases to and gradually decreased by cosine learning rate scheduling to . The training requires 160K iterations to converge. We use a dropout rate of 0.2 for attention matrices, 0.1 for embedding and hidden activation.
We start pruning from the converged baseline. Pruning follows identical training configurations except for the learning rate, which increases to and gradually decreased to . The pruning requires additional 80K iterations of training. As explained in Sec.3.3, we warm-up from zero to the desired value for the first 4K iterations. After 16K iterations, we stop training the gating parameters, so that the training continues without randomness for the remaining steps. Without this fixation, we observe that the sparsity continues to increase because of the influence of loss becomes too large, which causes the network much difficult to be fine-tuned. We explore to control the trade-off between the sparsity and performance.
4.2 Results on Language Modeling
Tables 1 and 2 show the results of attention head pruning on two benchmarks. As expected, the number of parameters linearly decreases as sparsity increases on All-att models. We observe a clear trade-off between sparsity and performance for both datasets.
To compare with the original Transformer architecture, we train TXL models with reduced dimensions under the same configuration. Each TXL model utilizes the same number of layers and heads, whereas the hidden dimension decreases from 512 by 32 in sequence. Both All-att and TXL baselines achieve almost same perplexity and the parameter size. Figure 3 shows that All-att models with attention head pruning achieve substantially better parameter efficiency than the TXL models. For example, pruned All-att model with 43% sparsity (30.7M) achieves similar perplexity as TXL with only 25% sparsity (47.9M).
We empirically show that the proposed three methods each contribute to the improvement. Table 3 compares the effect of each technique by ablation. The most influential change is achieved by output scaling (+1.25), however, the other two also take a portion of the improvement. All-att model without proposed techniques (denoted as "vanilla"), is expected to suffer from a similar level of performance degradation as TXL, which implies that the potential of pruning efficiency on All-att cannot be fully utilized without our techniques.
In this paper, we introduced layer-wise attention head pruning for All-attention Transformer models and proposed three techniques to reduce the performance degradation of the pruned model and stabilize the pruning process. Experiments on language modeling demonstrate that the proposed method achieves a better performance than traditional Transformer models with a comparable number of parameters.