DeepAI
Log In Sign Up

Low-Rank Bottleneck in Multi-head Attention Models

02/17/2020
by   Srinadh Bhojanapalli, et al.
Google
MIT
9

Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation. We further validate this in our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/17/2021

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Multi-head attention plays a crucial role in the recent success of Trans...
05/09/2021

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

After their successful debut in natural language processing, Transformer...
06/29/2020

Multi-Head Attention: Collaborate Instead of Concatenate

Attention layers are widely used in natural language processing (NLP) an...
09/20/2020

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural l...
05/31/2021

Cascaded Head-colliding Attention

Transformers have advanced the field of natural language processing (NLP...
03/05/2020

Talking-Heads Attention

We introduce "talking-heads attention" - a variation on multi-head atten...
11/06/2019

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence ...

1 Introduction

Attention based architectures, such as Transformers, have been effective for sequence modelling tasks such as machine translation (gehring2017convolutional; vaswani2017attention), question answering, sentence classification (radford2018improving; devlin2018bert) and document generation (liu2018generating). These models have emerged as better alternatives to the recurrent models - RNNs (sutskever2014sequence), LSTMs (hochreiter1997long) and GRUs (cho2014properties). This is mainly due to their feed forward structure, which removes the sequential processing bottleneck for sequence data, making them easier to train compared to the recurrent models. Self attention models also have found applications in vision (wang2018non), adversarial networks (zhang2018self)

, reinforcement learning

(zambaldi2018relational; li2017deep) and speech recognition (chiu2018state).

Recent advances in using the self attention models in natural language tasks have been made by first using a language modeling task to pre-train the models and then fine tuning the learned models on specific downstream tasks. radford2018improving and devlin2018bert used Transformers to pre-train a language model and showed that the fine tuned model outperforms LSTMs on many natural language understanding and question answering tasks. For example, BERT (devlin2018bert), a 24 layer transformer model, is shown to achieve the state of the art performance on several NLP tasks, including on the SQuAD dataset. These advances, in addition to novel pre-training tasks, relied on bigger models with a larger embedding size. BERT model uses an embedding size of 1024 (devlin2018bert)

; GPT-2 uses models with embedding size up to 1600

(radford2019language).

A single Transformer block consists of two key components: a multi-head self attention layer followed by a feed forward layer (vaswani2017attention). A single head in a multi-head attention layer, computes self attention between the tokens in the input sequence, which it then uses to compute a weighted average of embeddings for each token. Each head projects the data into a lower dimensional subspace, and computes the self attention in this subspace. This projection size for each head is commonly referred to as the head size.

To keep the number of parameters fixed in the attention layer regardless of the number of heads, the prevalent heuristic is to scale the head size with 1/(number of heads). This heuristic was initially proposed in

vaswani2017attention and has become a de facto standard heuristic in multi-head attention models (radford2018improving; devlin2018bert). However, increasing the number of heads decreases the head size, decreasing the expressive power of individual heads. We prove that reducing the head size to a value below the input sequence length harms the representation power of each head (see Theorem 1). This is because a smaller head size introduces a rank constraint on the projection matrices in each head, and limits their representation power. We indeed notice this effect in practice: while the performance improves with increasing the number of heads in the beginning (devlin2018bert), we notice a drop in the performance once the number of heads increases beyond a certain threshold, as seen in Table 1 and Fig. 1 (see also Table 4(A) in vaswani2017attention).

In order to avoid hurting the performance, the existing models allow for multiple heads by increasing the embedding size, which in turn increases the head size. However, larger embedding size, in addition to increasing the number of parameters, makes it expensive to use the model and the learned embeddings in downstream tasks, as the downstream model sizes scale with the embedding size of the tokens. For example, the inference time and memory required in retrieval tasks typically increases linearly with the embedding size.

# heads 8 16 32
# params 336M 336M 336M
SQuAD - F1 90.890.15 90.610.14 90.450.08
SQuAD - EM 84.10.34 83.750.27 83.480.13
MNLI 850.2 84.50.4 84.40.2
Table 1: Performance of (devlin2018bert), a 24 layer Transformer with an embedding size of 1024, suffers with the increasing number of heads after 8 heads.

In this paper we propose setting the head size of attention units to input sequence length. While this is a simple hyper-parameter change in the Transformer architecture, we show that it is important to set this value appropriately to avoid the low-rank bottleneck (see Theorem 1), and to improve the representation power (see Theorem 2). This fixed head size is also independent of both the number of heads and the embedding size of the model. This allows us to train models with a relatively smaller embedding size (hence fewer parameters) without affecting the head size. Another advantage of the fixed head size is that unlike the standard setting which requires the number of heads to be a factor of the embedding size, we are free to set an arbitrary number of heads as required for the task.

Interestingly, we note that this simple yet novel approach of fixing the head size in multi-head Transformers results in empirically superior performance. We evaluate Transformers trained with this fixed head size on language modeling (LM1B dataset), natural language inference (MNLI dataset) and question answering tasks (SQuAD dataset). We show that fixing the head size allows us to train Transformers with a better performance scaling and smaller embedding size. We show that with the fixed head size Transformers trained with an embedding size of 512 can match the performance of the (devlin2018bert), a Transformer with an embedding size of 1024 (see Fig. 2). We further present experimental results evaluating the effect of different choices of the head size and the embedding size in Section 4.

Our contributions in this paper lie in identifying and rigorously proving the low rank bottleneck in multi-head attention models, and showing that fixing the head size to input sequence length results in a strictly better model, both theoretically and empirically. The contributions of this paper are summarized below.

  • We analyze the representation power of the multi-head self attention layer and prove the low-rank bottleneck the head size places on the attention units (Theorem 1).

  • We propose to set the head size to input sequence length, and show that fixing the head size strictly improves the expressive power of the multi-head attention layers compared to the standard heuristic for setting the head size (Theorem 2). This allows us to both increase the number of heads per layer and decrease the embedding size, without hurting the performance. We develop a novel construction based approach to prove this result, which can potentially be useful in analyzing other variants of the Transformer architecture.

  • We experimentally show that with a fixed head size, Transformers can be trained with better performance scaling and a smaller embedding size on three standard NLP tasks.

1.1 Related Works

Given the significance of self attention models, there has been work trying to both improve the performance and speed up the computation in Transformers. ott2018scaling and you2019reducing reduce precision and use large batch training to reduce the training time of the attention models. child2019generating propose sparse self attention models to speed up the computation in the attention layer for long sequence data generation tasks. They show that these sparse attention models can be trained on tasks with sequence length greater than 10k without sacrificing the accuracy. dehghani2018universal

propose a depth recurrent Transformer network that reuses the parameters across layers. They show that this modification makes the Transformer networks Turing complete even with finite precision weights.

yang2019xlnet propose a new way to increase the effective sequence length that the Transformer attends to, by reusing the intermediate embeddings across sequences. They show that the modified architecture performs better on tasks that require computing context over longer sequence lengths. We note that most of these modifications rely on the multi-head self attention, the same building block of the Transformers. Our work is studying this basic multi-head attention layer, and suggesting a new way to set the head size, which can be easily applied along with any of the above architectural modifications.

wu2019pay propose to replace the self-attention layer with lightweight dynamic convolutions and show improved performance on machine translation and language modeling. Even though the resulting model has faster inference time, it still needs to use a large embedding size (1024), as big as the original attention models. We believe the techniques in this paper can be combined with these results to realize both smaller embedding size and faster inference time.

sun2019token perform neural architecture search using evolutionary methods on sequence to sequence models and find an evolved transformer architecture, which in addition to multi-head attention units, has convolution filter and gated linear units. Our proposed modifications stay closer to Transformers in spirit and can be used as seed units for this architecture search.

yang2017breaking have studied the effect of rank constraint caused by the small projection sizes in computing the softmax loss. The situation in self attention layers is a bit different. While the expressive power of each head reduces with the decreasing head size, at the same time we are increasing the number of heads, which can potentially negate this and increase the overall capacity of the layer. As we show in Theorem 2, the prevalent head size heuristic indeed limits the expressive power of the multi-head attention layer.

yun2019transformers studied the representation power of Transformers and showed that they are universal approximators of sequence to sequence functions. However they do not study the low rank bottleneck caused by the prevalent head size heuristic and its connection to the embedding size.

voita2019analyzing; michel2019sixteen study the importance of different heads in an attention layer. They observe that, during inference, many of the heads in each layer can be pruned away with a little effect on the prediction. However, they still need multiple heads during the training.

child2019generating; correia2019adaptively impose sparsity structure on the attention layer during training to improve both interpretability and performance. Fixing the head size will in fact make it easier to learn such sparsity patterns, as a low rank constraint does not allow a head to express all possible sparsity patterns. Combining these techniques can hence potentially enable training of sparse attention models with a smaller embedding size.

2 Transformer Architecture and Analysis

In this section, we present the Transformer architecture and analyze the representation power of the multi-head self attention, a key component of the Transformer block.

The input to a Transformer network is a sequence of tokens. Typically, each token is converted into a token embedding of dimension by an embedding layer. We let be the embedding matrix corresponding to the tokens in the input sequence.

2.1 Single-Head Attention

The Transformer block is a combination of a self attention layer followed by a feed forward layer (vaswani2017attention). Both layers have a skip connection and use Layer Normalization (LN) (ba2016layer). In particular, for token embeddings , the dot product attention is computed as follows.

(1)

Here , and represent the projection matrices associated with the query, key and value respectively in an attention unit (vaswani2017attention). For a single-head attention unit, we have . In the dot-product attention (cf. (1)), aims to capture the context of the input for a given token based on the remaining tokens in the input sequence. Subsequently, the output of the attention layer takes the following form.

(2)

where represents the layer-normalization operation. Given the attention module, as defined in (1), it is natural to question its ability to represent arbitrary contexts for a given input sequence .

In the following result we establish that for a large enough projection size an attention unit can represent any data pair . We also show that the model cannot represent arbitrary context when is smaller than , creating a low-rank bottleneck.

Theorem 1 (Representation Theorem).

If , then given any full column rank matrix and an arbitrary

positive column stochastic matrix

, there always exists projection matrices and such that

(3)

If , there exist and such that (3) does not hold for all and .

This result shows that the projection dimension needs to be larger than the sequence length for the attention unit to be able to represent any desired context . Even though this result describes a single example sequence case, it highlights a fundamental property of the model architecture that decreasing the projection size below a certain threshold introduces a bottleneck.

Proof of Theorem 1.

case. To prove the first part of the result, we present an explicit construction of and which allows us to generate from using the dot product attention. Since has full column rank, there exists a left inverse such that . Let and . Then

(4)

Now that the above choice of and has handled the dependence on , we will choose a depending on and finish the construction. Below we express the Softmax operation on the query and key inner products. Note that the Softmax here is a columnwise operator computing the attention scores for each query. By using (2.1), we obtain that

where is an diagonal matrix such that

Hence, we can establish the desired result by showing that there always exists a that satisfies the following fixed point equation.

(5)

Given , to construct such a , we pick an arbitrary positive diagonal matrix , and set

(6)

Since is a positive matrix, such a always exists. Next, we verify that this construction indeed satisfies the fixed point equation (cf. (5)). Note that

(7)

The last equation follows from the fact that is a column stochastic matrix. Now, using (6) and (7),

This completes the first part of the proof.
 
case. Consider the case of and . Then and and . Let . Then

This matrix clearly cannot be used to generate that have distinct elements in the second column, e.g., . ∎

2.2 Multi-Head Attention

As discussed in Section 2.1, an attention unit updates the embedding of an input token based on a weighted average of the embeddings of all the tokens in the sequence, using the context (cf. (1)). vaswani2017attention proposed Multi-Head attention mechanism that increases the representation power of an attention layer, where multiple attention units operate on different low dimensional projections of the input, with each attention unit being referred to as a head. This is followed by concatenation of the outputs from different heads. In particular, the computation inside a Multi-Head attention with heads takes the following form:

The output of the Multi-head attention layer then becomes

(8)

where . For a model with heads, the query, key and value projection matrices , and are matrices. Therefore, each head projects the input onto a -dimensional subspace to compute the context, and keeps the number of parameters fixed per layer. Using MultiHead has resulted in empirically better performance over the single head attention layer (vaswani2017attention).

2.3 Low-Rank Bottleneck

While increasing the number of heads seemingly gives the model more expressive power, at the same time we are reducing the head size, which can decrease the expressive power. When the number of heads is larger than , the attention unit inside each head projects onto a dimension smaller than

, creating a low-rank bottlenck and loses its ability to represent arbitrary context vectors (cf. Theorem

1). Interestingly, this is consistent with the empirical observation in Table 1 that increasing beyond 8 results in performance degradation in (devlin2018bert); note that and for most of the pre-training phase of .

Since the sequence length is fixed from the data/task at hand, the only way to increase the number of heads without introducing the low-rank bottleneck is by increasing the embedding size . This is a fundamental limitation of the currently dominant head size heuristic, that we need to increase the embedding size in order to support more heads.

Unfortunately, increasing the embedding size leads to higher computation and memory requirements to train and store the model. Further, since it is common to use learned embeddings from Transformer based models for downstream tasks (devlin2018bert), larger embedding size increases the model size and computation required for all the downstream tasks as well.

(a) LM1B
(b) LM1B
Figure 1: Performance of Transformers trained with the prevalent head size heuristic () (baseline) compared with the fixed head size () on a language modeling task (LM1B) on the test set. We train baseline models with embedding sizes from 256 to 512. We train the fixed head size models with a fixed embedding size of 256 and a head size of 32, and vary the number of heads from 4 to 70, while matching the number of parameters. The plots clearly indicate that fixing the head size allows us to train Transformers with a smaller embedding size (plot (b)), and with a better scaling of performance (plot (a)). Note that for perplexity lower values are better.

3 Fixed Multi-Head Attention

In this section we propose to fix the head size of the Transformer, which allows us to enjoy the advantage of higher expressive power of multiple heads without requiring the embedding size to be large. The key is to decouple the dependency between the projection size in a head and the embedding size of the model. The projection matrices now project onto subspaces of a fixed dimension irrespective of the number of heads . This approach where is independent of and leads to the following attention mechanism.

Note that the projection matrices used here , and are matrices. With , the output of this new multi-head attention layer takes the following form.

This modification makes each attention head more similar to a hidden unit in a feed forward network or a filter in a convolutional network, and allows us to vary the number of heads without worrying about reducing the representation power per head. The downside is, unlike the standard MultiHead, the number of parameters per layer increases with the number of heads. However, this modification allows us to train a model with a smaller embedding size without a low-rank bottleneck, ultimately allowing us to reduce the total number of parameters in the model.

3.1 MultiHead vs. FixedMultiHead Attention

Given a MultiHead layer, we can always represent it using a FixedMultiHead layer, whenever we have the head size . While this shows that increasing the number of heads beyond makes individual heads of the FixedMultiHead as expressive as the ones in the MultiHead, it is not obvious if FixedMultiHead is strictly more expressive. Can the FixedMultiHead layer represent functions that the standard MultiHead layer can not represent? In this subsection we show that indeed, in the multi-head regime, the FixedMultiHead layer is strictly better than the standard MultiHead layer in terms of expressive power.

Consider the standard multi-head attention units in (8).

We denote the collection of all parameter matrices as . Similarly, consider the function represented by the fixed head size attention units:

Let be the collection of all these parameter matrices. We define and to be the class of functions and , respectively. As noted above, if , we have .

The following theorem shows that even for simple examples in , functions in fail to represent them; this already shows that is a strict subset of .

Theorem 2.

Let , , and . Consider a FixedMultiHead attention layer with parameters that satisfy the following conditions:

Then, for any , there exists such that

Because is a continuous function of , existence of such an implies that the integral of the norm of difference (i.e., approximation error) is strictly positive. We note that the assumptions on and in the above Theorem are made to provide a simple and constructive proof; in fact, failure of MultiHead () to represent such simple attention layers suggests that the situation is likely worse for more complex functions.

Theorem 2 shows that the expressive power of the FixedMultiHead attention function class is strictly superior to the standard MultiHead attention function class. Hence the heuristic of reducing the head size with the number of heads is limiting the expressive power of MultiHead, whereas using the fixed head size will increase the expressive power of the attention layers.

(a) SQuAD F1
(b) SQuAD EM
(c) MNLI
Figure 2: Comparison of 24 layer Transformer models trained with the prevalent head size heuristic (baseline) vs. the fixed head size model on the SQuAD and MNLI dev sets. We vary the embedding size of the baseline models from 512 to 1024. We train the fixed head size models with a fixed embedding size of 512 and a head size of 128, with a varying number of heads from 8 to 32, while matching the number of parameters. Fixing the head size allows us to train models with a smaller embedding size of 512 and with a better performance.
(a)
(b)
Figure 3: Ablation studies on LM1B: (a) We fix the embedding size of all the models to 256 and vary the capacity of Transformers trained with the prevalent head size heuristic (baseline) by increasing the size of the feedforward layers. For the fixed head size models we fix the head size to 32, so 8 head fixed head size model is the same as the 8 head baseline model. We notice that again with the standard heuristic increasing the number of heads beyond 16 hurts the performance, whereas with a fixed head size increasing the number of heads monotonically improves the performance. (b) We show the effect of head size on the performance with different number of heads. Both plots clearly show the advantage in having an additional way to tune the capacity of Transformers with a fixed embedding size.

4 Experiments

The goal of this section is to show that setting the head size in a principled way leads to better performance than using the prevalent heuristic. We again note that while this is a simple hyper-parameter change to the Transformer, setting this to input sequence length as shown in our analysis, allows us to train better models with a smaller embedding size.

In this section we present our experiments on three standard NLP tasks, language modeling (LM1B), question answering (SQuAD), and sentence entailment (MNLI), to demonstrate: 1) Increasing the number of heads in Transformers beyond a certain point hurts the performance with the prevalent head size heuristic, but always helps with the fixed head size attention layers; 2) Decoupling the head size from embedding size allows us to train models with a smaller embedding size; and 3) Setting the head size appropriately in the Transformers allows us to train models with a better performance scaling. We first describe our experimental setup followed by our results and ablation studies on the proposed modifications.

4.1 Setup and Datasets

For the language modeling task we use the one billion word benchmark dataset (LM1B) (lm1b). This dataset has around 30M training examples and around 300k examples in the test set. We use a sub-word tokenizer with 32k vocab and cap the input to 256 sequence length. We train a 6 layer Transformer model with the ADAM optimizer using the tensor2tensor library (tensor2tensor). The detailed experimental setting is presented in Section C.

Multi-Genre Natural Language Inference (MNLI) is a sentence level entailment task, designed to test natural language understanding (MNLI). Given a premise sentence and a hypothesis sentence, the goal is to predict whether hypothesis entails, contradicts or is neutral to the premise. We report the classification accuracy for this task. Stanford Question Answering Dataset (SQuAD) is a question answering dataset, where given a paragraph and a question, the goal is to predict the sequence of words in the paragraph that constitute the answer to the question (rajpurkar2016squad). This is a harder word level task, compared to the sentence classification task. We report both Exact Match (EM) and F1 scores for this task. All results in this section are reported on the Dev set, which has not been used in any experimental choices in this paper.

For these latter two tasks, we follow the two stage approach of first pre-training on a language modeling task and then fine-tuning the models on the task data. We follow the same experimental setup for both pre-training and fine-tuning as BERT (devlin2018bert), and use their codebase111https://github.com/google-research/bert. We first pre-train our models using the masked language model and the next sentence prediction objectives, and then fine tune the pre-trained model for individual tasks (devlin2018bert). For pre-training we use English Wikipedia and BooksCorpus dataset (zhu2015aligning). The input to the models is tokenized using the WordPiece representation with 30000 tokens in the vocabulary. We present the key experiment choices in Section C, and refer the reader to devlin2018bert for a complete description of the setup.

# heads 8 12 16 32
# params 168M 193M 218M 319M
SQuAD - F1 89.60.17 90.250.21 90.430.14 90.950.14
SQuAD - EM 82.730.21 83.180.24 83.590.06 84.40.29
MNLI 83.50.2 84.20.2 83.90.2 84.90.2
(A) Increasing number of heads
head size 32 64 128 256
# params 130M 142M 168M 218M
SQuAD - F1 88.530.06 89.510.15 89.60.17 90.330.23
SQuAD - EM 81.190.21 82.410.32 82.730.21 83.360.48
MNLI 82.50.1 83.40.3 83.50.2 83.90.2
(B) Increasing head size
Table 2: Ablation studies on SQuAD and MNLI: (A) 24 layer Transformer with a fixed head size of 128 and 512 embedding size shows an improvement in the accuracy with the increasing number of heads. (B) The fixed head size model with 512 embedding size and 8 heads shows an improvement in accuracy with the increasing head size. This shows that indeed head size is an important capacity controlling parameter in the self attention architecture.

Choice of the head size. Our proposed modification introduces head size as a new model hyper-parameter. We choose head size to be for our BERT experiments, as most of the pre-training is done with 128 sequence length data. While we have ablation studies (cf. Table 2(B)) showing bigger head size improves the performance, there is a tradeoff between increasing the head size vs number of heads vs layers. We found that having sufficiently large head size, e.g., matching the pre-training sequence length, is better than having a larger embedding size.

4.2 Results

For our first set of experiments we want to see if Transformers trained with a fixed head size and a smaller embedding size can match the performance of training with the standard head size heuristic but with a larger embedding size. As a baseline for the language modeling task, we train Transformers with the embedding size increasing from 256 to 512 with different number of heads. We train the fixed head size models with a fixed embedding size of 256 and a head size of 32, with an increasing number of heads from 4 to 70. We notice that Transformers with a fixed head size and an embedding size of 256 have better performance than the baseline models with an embedding size of 512 (see Fig. 1). We repeat the similar experiment on the other two tasks, where for baseline we train , a 24 layer, 16 head Transformer with the standard head size heuristic, with embedding sizes from 512 to 1024. We compare it with the fixed head size model, with an embedding size of 512 and a head size of 128, with an increasing number of heads from 8 to 32. We again notice that the Transformers trained with a fixed head size and 512 embedding size have better performance than the baseline, (see Fig. 2).

Note that simply trying to increase the head size of the Transformers by decreasing the number of heads does not improve the performance, as decreasing the number of heads reduces the expressive power of the model (see Fig. 4 in the Appendix). Hence, both the head size and the number of heads needs to be set high enough for better performance.

4.3 Ablation

Increasing heads. From Table 1 and Fig. 0(a) we can see that increasing the number of heads hurts the performance of the Transformer after a certain number. We repeat the same experiments with the fixed head size Transformer, and present the results in Table 2(A) and Fig. 2(a). The results show that the performance of the modified model improves monotonically as the number of heads increase. This is because the model capacity (a function of the head size) is no longer reduced with the increasing number of heads.

Increasing head size. In Table 2(B) and Fig. 2(b), we present comparisons between models with different head sizes. This shows that the gains in the performance of the fixed head size models indeed come from adjusting the head size of the query, key and value layers in the attention unit. The table shows a clear trend of better performance with a larger head size, suggesting that it indeed is an important factor in the performance of the attention models.

5 Conclusion

In this paper we studied the representation power of the multi-head self attention models and proved the low-rank bottleneck that results from a small head size in the multi-head attention. We showed that the larger embedding size used in the current models is a consequence of this low-rank bottleneck in multi-head attention layers. We propose to instead use fixed head size attention units, with the head size set to input sequence length, to avoid this bottleneck. We showed that it allows us to increase the number of heads without increasing the embedding size. As a consequence we are able to train Transformers with a smaller embedding size and fewer parameters, with better performance. In the future, it will be interesting to experiment with varying head sizes within an attention block and across layers. This requires further understanding of the role of each layer in computing the context, which is an interesting direction for the future work.

References

Appendix A Notation

Embedding size
Number of layers
Number of heads
Sequence length
Vocab size
Head size

Appendix B Proofs

Proof of Theorem 2.

First let us rewrite the MultiHead and FixedMultiHead layers as follows. The MultiHead layer can be rewritten as

where are matrices and , , and are matrices. We denote the collection of all parameter matrices as .

Similarly, rewrite the fixed head size attention layer as

where , and . Let be the collection of all these matrices.

The outline of the proof is basically a case analysis: we divide possible values of into three categories, and show in each case that there exists a such that . Here are the three cases:

  • Case 1: .

  • Case 2: , and there exists such that

    is not skew-symmetric.

  • Case 3: , and all are skew-symmetric.

Case 1.

In the first case, we can choose any such that . Choose . Then, note that for any column stochastic matrix , we have . Therefore,

Case 2.

In cases where , since is full rank by assumption and each is at most rank , it follows that all columns in must be linearly independent. Therefore, for any ,

is a set of linearly independent vectors, because each

is a linear combination of column vectors of that are linearly independent of other column vectors in , .

Now consider any , and , where . Define . Then, we have

Similarly, we can calculate

Notice that all the columns of and , from the second columns to the last ones, are the same. We now compare the first columns:

Recall that for any , are linearly independent, so if and only if all are zero. However, since there exists such that is not skew-symmetric, we can choose to be one that satisfies , hence making , therefore .

Case 3.

Now consider any , where and will be chosen later. Define , . Then, we have

Therefore, the first column of can be written as

Similarly, the first column of is

Since is skew-symmetric by assumption, we have for all . Recall that is rank- by assumption, so is at least rank , so we can choose any such that .

If both and are nonzero, We can always choose such that and . This means that if we choose and scale ,

Then, consider the difference . Recall that for any , is independent of . This means that, to show , it suffices to show that

If we scale with large enough , the second term will dominate the first term and the first term will never be able to cancel the second one. Thus, by choosing large enough , we can make sure that the sum is nonzero.

Even in case where one of and is zero (say ), we can choose and use a similar scaling argument. By choosing large enough and , one can show that the difference is nonzero. ∎

Appendix C Experimental settings

For our experiments with the language modeling (LM1B dataset), we train 6 layer Transformer models. We use a batch size of 4096 and train for 250k steps. We use a learning rate of 0.1 with a linear warm up for the first 10k steps. We decay the learning rate with the square root of the number of steps. We train the baseline models, with the prevalent head size heuristic, with the embedding dimension varying from 256 to 512. We fix the width of the feed forward layer in the Transformer to be 1024. In addition, we use weight decay of 0.01 and dropout with probability of 0.1 on all the layers.

For our experiments with BERT, we follow the same experimental settings as in [devlin2018bert]. We present the key details here and refer the reader to [devlin2018bert]. We train with a batch size of 1024 for 450k steps with inputs of sequence length = 128 followed by 50k steps with inputs of sequence length 512. In contrast the BERT paper uses a batch size of 512, and does the pre-training for 900K steps with 128 sequence length inputs and 100k steps with 512 sequence length inputs. We train using ADAM with a learning rate of 1e-4, and a linear warmup and decay schedule as in BERT. We use 5k warmup steps for the first stage, and a re-warmup of 3k steps for the second stage [you2019reducing]. Again, we use weight decay of 0.01 and dropout with probability of 0.1 on all the layers.

For the language modeling task, training is performed on 4 TPUv2 chips for a couple of hours. For BERT models training is performed on 16 TPUv3 chips in the first stage and 64 TPUv3 chips for the second stage. Pre-training with this configuration takes between 2 to 3 days. We did not attempt to find the optimal hyper-parameters for the fixed head size architecture, and use the same hyper-parameters as used for training the BERT models.

Appendix D Additional experimental results

Figure 4: Performance of the Transformers trained with the prevalent head size heuristic (baseline) compared with the fixed head size () models for a language modeling task (LM1B) on the test set. Unlike Fig.1, we vary both the embedding size and the number of heads of the baseline models to keep their head size fixed to 32. We train the fixed head size models with a fixed embedding size of 256 and a head size of 32, and vary the number of heads from 4 to 70, while matching the number of parameters. The plot again clearly indicates the advantage of the fixed head size models. The main issue with the baseline models is that fixing the head size to 32 forces the number of heads to be small when the embedding size is small. Reducing the number of heads below certain threshold hurts the performance of the Transformer.
# heads 8 12 16 20
# params 214M 252M 290M 327M
SQuAD - F1 90.350.14 90.480.09 90.920.14 90.890.08
SQuAD - EM 83.370.12 83.670.03 84.160.35 84.290.16
MNLI 84.40.2 84.40.2 84.70.1 85.10.4
(A) Increasing number of heads
Table 3: (A): 24 layer Transformer trained with a fixed head size of 128 and an embedding size of 768 shows an improvement in the accuracy with the increasing number of heads.