1 Introduction
Attention based architectures, such as Transformers, have been effective for sequence modelling tasks such as machine translation (gehring2017convolutional; vaswani2017attention), question answering, sentence classification (radford2018improving; devlin2018bert) and document generation (liu2018generating). These models have emerged as better alternatives to the recurrent models  RNNs (sutskever2014sequence), LSTMs (hochreiter1997long) and GRUs (cho2014properties). This is mainly due to their feed forward structure, which removes the sequential processing bottleneck for sequence data, making them easier to train compared to the recurrent models. Self attention models also have found applications in vision (wang2018non), adversarial networks (zhang2018self)
(zambaldi2018relational; li2017deep) and speech recognition (chiu2018state).Recent advances in using the self attention models in natural language tasks have been made by first using a language modeling task to pretrain the models and then fine tuning the learned models on specific downstream tasks. radford2018improving and devlin2018bert used Transformers to pretrain a language model and showed that the fine tuned model outperforms LSTMs on many natural language understanding and question answering tasks. For example, BERT (devlin2018bert), a 24 layer transformer model, is shown to achieve the state of the art performance on several NLP tasks, including on the SQuAD dataset. These advances, in addition to novel pretraining tasks, relied on bigger models with a larger embedding size. BERT model uses an embedding size of 1024 (devlin2018bert)
; GPT2 uses models with embedding size up to 1600
(radford2019language).A single Transformer block consists of two key components: a multihead self attention layer followed by a feed forward layer (vaswani2017attention). A single head in a multihead attention layer, computes self attention between the tokens in the input sequence, which it then uses to compute a weighted average of embeddings for each token. Each head projects the data into a lower dimensional subspace, and computes the self attention in this subspace. This projection size for each head is commonly referred to as the head size.
To keep the number of parameters fixed in the attention layer regardless of the number of heads, the prevalent heuristic is to scale the head size with 1/(number of heads). This heuristic was initially proposed in
vaswani2017attention and has become a de facto standard heuristic in multihead attention models (radford2018improving; devlin2018bert). However, increasing the number of heads decreases the head size, decreasing the expressive power of individual heads. We prove that reducing the head size to a value below the input sequence length harms the representation power of each head (see Theorem 1). This is because a smaller head size introduces a rank constraint on the projection matrices in each head, and limits their representation power. We indeed notice this effect in practice: while the performance improves with increasing the number of heads in the beginning (devlin2018bert), we notice a drop in the performance once the number of heads increases beyond a certain threshold, as seen in Table 1 and Fig. 1 (see also Table 4(A) in vaswani2017attention).In order to avoid hurting the performance, the existing models allow for multiple heads by increasing the embedding size, which in turn increases the head size. However, larger embedding size, in addition to increasing the number of parameters, makes it expensive to use the model and the learned embeddings in downstream tasks, as the downstream model sizes scale with the embedding size of the tokens. For example, the inference time and memory required in retrieval tasks typically increases linearly with the embedding size.
# heads  8  16  32 

# params  336M  336M  336M 
SQuAD  F1  90.890.15  90.610.14  90.450.08 
SQuAD  EM  84.10.34  83.750.27  83.480.13 
MNLI  850.2  84.50.4  84.40.2 
In this paper we propose setting the head size of attention units to input sequence length. While this is a simple hyperparameter change in the Transformer architecture, we show that it is important to set this value appropriately to avoid the lowrank bottleneck (see Theorem 1), and to improve the representation power (see Theorem 2). This fixed head size is also independent of both the number of heads and the embedding size of the model. This allows us to train models with a relatively smaller embedding size (hence fewer parameters) without affecting the head size. Another advantage of the fixed head size is that unlike the standard setting which requires the number of heads to be a factor of the embedding size, we are free to set an arbitrary number of heads as required for the task.
Interestingly, we note that this simple yet novel approach of fixing the head size in multihead Transformers results in empirically superior performance. We evaluate Transformers trained with this fixed head size on language modeling (LM1B dataset), natural language inference (MNLI dataset) and question answering tasks (SQuAD dataset). We show that fixing the head size allows us to train Transformers with a better performance scaling and smaller embedding size. We show that with the fixed head size Transformers trained with an embedding size of 512 can match the performance of the (devlin2018bert), a Transformer with an embedding size of 1024 (see Fig. 2). We further present experimental results evaluating the effect of different choices of the head size and the embedding size in Section 4.
Our contributions in this paper lie in identifying and rigorously proving the low rank bottleneck in multihead attention models, and showing that fixing the head size to input sequence length results in a strictly better model, both theoretically and empirically. The contributions of this paper are summarized below.

We analyze the representation power of the multihead self attention layer and prove the lowrank bottleneck the head size places on the attention units (Theorem 1).

We propose to set the head size to input sequence length, and show that fixing the head size strictly improves the expressive power of the multihead attention layers compared to the standard heuristic for setting the head size (Theorem 2). This allows us to both increase the number of heads per layer and decrease the embedding size, without hurting the performance. We develop a novel construction based approach to prove this result, which can potentially be useful in analyzing other variants of the Transformer architecture.

We experimentally show that with a fixed head size, Transformers can be trained with better performance scaling and a smaller embedding size on three standard NLP tasks.
1.1 Related Works
Given the significance of self attention models, there has been work trying to both improve the performance and speed up the computation in Transformers. ott2018scaling and you2019reducing reduce precision and use large batch training to reduce the training time of the attention models. child2019generating propose sparse self attention models to speed up the computation in the attention layer for long sequence data generation tasks. They show that these sparse attention models can be trained on tasks with sequence length greater than 10k without sacrificing the accuracy. dehghani2018universal
propose a depth recurrent Transformer network that reuses the parameters across layers. They show that this modification makes the Transformer networks Turing complete even with finite precision weights.
yang2019xlnet propose a new way to increase the effective sequence length that the Transformer attends to, by reusing the intermediate embeddings across sequences. They show that the modified architecture performs better on tasks that require computing context over longer sequence lengths. We note that most of these modifications rely on the multihead self attention, the same building block of the Transformers. Our work is studying this basic multihead attention layer, and suggesting a new way to set the head size, which can be easily applied along with any of the above architectural modifications.wu2019pay propose to replace the selfattention layer with lightweight dynamic convolutions and show improved performance on machine translation and language modeling. Even though the resulting model has faster inference time, it still needs to use a large embedding size (1024), as big as the original attention models. We believe the techniques in this paper can be combined with these results to realize both smaller embedding size and faster inference time.
sun2019token perform neural architecture search using evolutionary methods on sequence to sequence models and find an evolved transformer architecture, which in addition to multihead attention units, has convolution filter and gated linear units. Our proposed modifications stay closer to Transformers in spirit and can be used as seed units for this architecture search.
yang2017breaking have studied the effect of rank constraint caused by the small projection sizes in computing the softmax loss. The situation in self attention layers is a bit different. While the expressive power of each head reduces with the decreasing head size, at the same time we are increasing the number of heads, which can potentially negate this and increase the overall capacity of the layer. As we show in Theorem 2, the prevalent head size heuristic indeed limits the expressive power of the multihead attention layer.
yun2019transformers studied the representation power of Transformers and showed that they are universal approximators of sequence to sequence functions. However they do not study the low rank bottleneck caused by the prevalent head size heuristic and its connection to the embedding size.
voita2019analyzing; michel2019sixteen study the importance of different heads in an attention layer. They observe that, during inference, many of the heads in each layer can be pruned away with a little effect on the prediction. However, they still need multiple heads during the training.
child2019generating; correia2019adaptively impose sparsity structure on the attention layer during training to improve both interpretability and performance. Fixing the head size will in fact make it easier to learn such sparsity patterns, as a low rank constraint does not allow a head to express all possible sparsity patterns. Combining these techniques can hence potentially enable training of sparse attention models with a smaller embedding size.
2 Transformer Architecture and Analysis
In this section, we present the Transformer architecture and analyze the representation power of the multihead self attention, a key component of the Transformer block.
The input to a Transformer network is a sequence of tokens. Typically, each token is converted into a token embedding of dimension by an embedding layer. We let be the embedding matrix corresponding to the tokens in the input sequence.
2.1 SingleHead Attention
The Transformer block is a combination of a self attention layer followed by a feed forward layer (vaswani2017attention). Both layers have a skip connection and use Layer Normalization (LN) (ba2016layer). In particular, for token embeddings , the dot product attention is computed as follows.
(1) 
Here , and represent the projection matrices associated with the query, key and value respectively in an attention unit (vaswani2017attention). For a singlehead attention unit, we have . In the dotproduct attention (cf. (1)), aims to capture the context of the input for a given token based on the remaining tokens in the input sequence. Subsequently, the output of the attention layer takes the following form.
(2) 
where represents the layernormalization operation. Given the attention module, as defined in (1), it is natural to question its ability to represent arbitrary contexts for a given input sequence .
In the following result we establish that for a large enough projection size an attention unit can represent any data pair . We also show that the model cannot represent arbitrary context when is smaller than , creating a lowrank bottleneck.
Theorem 1 (Representation Theorem).
If , then given any full column rank matrix and an arbitrary
positive column stochastic matrix
, there always exists projection matrices and such that(3) 
If , there exist and such that (3) does not hold for all and .
This result shows that the projection dimension needs to be larger than the sequence length for the attention unit to be able to represent any desired context . Even though this result describes a single example sequence case, it highlights a fundamental property of the model architecture that decreasing the projection size below a certain threshold introduces a bottleneck.
Proof of Theorem 1.
case. To prove the first part of the result, we present an explicit construction of and which allows us to generate from using the dot product attention. Since has full column rank, there exists a left inverse such that . Let and . Then
(4) 
Now that the above choice of and has handled the dependence on , we will choose a depending on and finish the construction. Below we express the Softmax operation on the query and key inner products. Note that the Softmax here is a columnwise operator computing the attention scores for each query. By using (2.1), we obtain that
where is an diagonal matrix such that
Hence, we can establish the desired result by showing that there always exists a that satisfies the following fixed point equation.
(5) 
Given , to construct such a , we pick an arbitrary positive diagonal matrix , and set
(6) 
Since is a positive matrix, such a always exists. Next, we verify that this construction indeed satisfies the fixed point equation (cf. (5)). Note that
(7) 
The last equation follows from the fact that is a column stochastic matrix. Now, using (6) and (7),
This completes the first part of the proof.
case. Consider the case of and . Then and and . Let . Then
This matrix clearly cannot be used to generate that have distinct elements in the second column, e.g., . ∎
2.2 MultiHead Attention
As discussed in Section 2.1, an attention unit updates the embedding of an input token based on a weighted average of the embeddings of all the tokens in the sequence, using the context (cf. (1)). vaswani2017attention proposed MultiHead attention mechanism that increases the representation power of an attention layer, where multiple attention units operate on different low dimensional projections of the input, with each attention unit being referred to as a head. This is followed by concatenation of the outputs from different heads. In particular, the computation inside a MultiHead attention with heads takes the following form:
The output of the Multihead attention layer then becomes
(8) 
where . For a model with heads, the query, key and value projection matrices , and are matrices. Therefore, each head projects the input onto a dimensional subspace to compute the context, and keeps the number of parameters fixed per layer. Using MultiHead has resulted in empirically better performance over the single head attention layer (vaswani2017attention).
2.3 LowRank Bottleneck
While increasing the number of heads seemingly gives the model more expressive power, at the same time we are reducing the head size, which can decrease the expressive power. When the number of heads is larger than , the attention unit inside each head projects onto a dimension smaller than
, creating a lowrank bottlenck and loses its ability to represent arbitrary context vectors (cf. Theorem
1). Interestingly, this is consistent with the empirical observation in Table 1 that increasing beyond 8 results in performance degradation in (devlin2018bert); note that and for most of the pretraining phase of .Since the sequence length is fixed from the data/task at hand, the only way to increase the number of heads without introducing the lowrank bottleneck is by increasing the embedding size . This is a fundamental limitation of the currently dominant head size heuristic, that we need to increase the embedding size in order to support more heads.
Unfortunately, increasing the embedding size leads to higher computation and memory requirements to train and store the model. Further, since it is common to use learned embeddings from Transformer based models for downstream tasks (devlin2018bert), larger embedding size increases the model size and computation required for all the downstream tasks as well.
3 Fixed MultiHead Attention
In this section we propose to fix the head size of the Transformer, which allows us to enjoy the advantage of higher expressive power of multiple heads without requiring the embedding size to be large. The key is to decouple the dependency between the projection size in a head and the embedding size of the model. The projection matrices now project onto subspaces of a fixed dimension irrespective of the number of heads . This approach where is independent of and leads to the following attention mechanism.
Note that the projection matrices used here , and are matrices. With , the output of this new multihead attention layer takes the following form.
This modification makes each attention head more similar to a hidden unit in a feed forward network or a filter in a convolutional network, and allows us to vary the number of heads without worrying about reducing the representation power per head. The downside is, unlike the standard MultiHead, the number of parameters per layer increases with the number of heads. However, this modification allows us to train a model with a smaller embedding size without a lowrank bottleneck, ultimately allowing us to reduce the total number of parameters in the model.
3.1 MultiHead vs. FixedMultiHead Attention
Given a MultiHead layer, we can always represent it using a FixedMultiHead layer, whenever we have the head size . While this shows that increasing the number of heads beyond makes individual heads of the FixedMultiHead as expressive as the ones in the MultiHead, it is not obvious if FixedMultiHead is strictly more expressive. Can the FixedMultiHead layer represent functions that the standard MultiHead layer can not represent? In this subsection we show that indeed, in the multihead regime, the FixedMultiHead layer is strictly better than the standard MultiHead layer in terms of expressive power.
Consider the standard multihead attention units in (8).
We denote the collection of all parameter matrices as . Similarly, consider the function represented by the fixed head size attention units:
Let be the collection of all these parameter matrices. We define and to be the class of functions and , respectively. As noted above, if , we have .
The following theorem shows that even for simple examples in , functions in fail to represent them; this already shows that is a strict subset of .
Theorem 2.
Let , , and . Consider a FixedMultiHead attention layer with parameters that satisfy the following conditions:
Then, for any , there exists such that
Because is a continuous function of , existence of such an implies that the integral of the norm of difference (i.e., approximation error) is strictly positive. We note that the assumptions on and in the above Theorem are made to provide a simple and constructive proof; in fact, failure of MultiHead () to represent such simple attention layers suggests that the situation is likely worse for more complex functions.
Theorem 2 shows that the expressive power of the FixedMultiHead attention function class is strictly superior to the standard MultiHead attention function class. Hence the heuristic of reducing the head size with the number of heads is limiting the expressive power of MultiHead, whereas using the fixed head size will increase the expressive power of the attention layers.
4 Experiments
The goal of this section is to show that setting the head size in a principled way leads to better performance than using the prevalent heuristic. We again note that while this is a simple hyperparameter change to the Transformer, setting this to input sequence length as shown in our analysis, allows us to train better models with a smaller embedding size.
In this section we present our experiments on three standard NLP tasks, language modeling (LM1B), question answering (SQuAD), and sentence entailment (MNLI), to demonstrate: 1) Increasing the number of heads in Transformers beyond a certain point hurts the performance with the prevalent head size heuristic, but always helps with the fixed head size attention layers; 2) Decoupling the head size from embedding size allows us to train models with a smaller embedding size; and 3) Setting the head size appropriately in the Transformers allows us to train models with a better performance scaling. We first describe our experimental setup followed by our results and ablation studies on the proposed modifications.
4.1 Setup and Datasets
For the language modeling task we use the one billion word benchmark dataset (LM1B) (lm1b). This dataset has around 30M training examples and around 300k examples in the test set. We use a subword tokenizer with 32k vocab and cap the input to 256 sequence length. We train a 6 layer Transformer model with the ADAM optimizer using the tensor2tensor library (tensor2tensor). The detailed experimental setting is presented in Section C.
MultiGenre Natural Language Inference (MNLI) is a sentence level entailment task, designed to test natural language understanding (MNLI). Given a premise sentence and a hypothesis sentence, the goal is to predict whether hypothesis entails, contradicts or is neutral to the premise. We report the classification accuracy for this task. Stanford Question Answering Dataset (SQuAD) is a question answering dataset, where given a paragraph and a question, the goal is to predict the sequence of words in the paragraph that constitute the answer to the question (rajpurkar2016squad). This is a harder word level task, compared to the sentence classification task. We report both Exact Match (EM) and F1 scores for this task. All results in this section are reported on the Dev set, which has not been used in any experimental choices in this paper.
For these latter two tasks, we follow the two stage approach of first pretraining on a language modeling task and then finetuning the models on the task data. We follow the same experimental setup for both pretraining and finetuning as BERT (devlin2018bert), and use their codebase^{1}^{1}1https://github.com/googleresearch/bert. We first pretrain our models using the masked language model and the next sentence prediction objectives, and then fine tune the pretrained model for individual tasks (devlin2018bert). For pretraining we use English Wikipedia and BooksCorpus dataset (zhu2015aligning). The input to the models is tokenized using the WordPiece representation with 30000 tokens in the vocabulary. We present the key experiment choices in Section C, and refer the reader to devlin2018bert for a complete description of the setup.
# heads  8  12  16  32 

# params  168M  193M  218M  319M 
SQuAD  F1  89.60.17  90.250.21  90.430.14  90.950.14 
SQuAD  EM  82.730.21  83.180.24  83.590.06  84.40.29 
MNLI  83.50.2  84.20.2  83.90.2  84.90.2 
(A) Increasing number of heads  
head size  32  64  128  256 
# params  130M  142M  168M  218M 
SQuAD  F1  88.530.06  89.510.15  89.60.17  90.330.23 
SQuAD  EM  81.190.21  82.410.32  82.730.21  83.360.48 
MNLI  82.50.1  83.40.3  83.50.2  83.90.2 
(B) Increasing head size 
Choice of the head size. Our proposed modification introduces head size as a new model hyperparameter. We choose head size to be for our BERT experiments, as most of the pretraining is done with 128 sequence length data. While we have ablation studies (cf. Table 2(B)) showing bigger head size improves the performance, there is a tradeoff between increasing the head size vs number of heads vs layers. We found that having sufficiently large head size, e.g., matching the pretraining sequence length, is better than having a larger embedding size.
4.2 Results
For our first set of experiments we want to see if Transformers trained with a fixed head size and a smaller embedding size can match the performance of training with the standard head size heuristic but with a larger embedding size. As a baseline for the language modeling task, we train Transformers with the embedding size increasing from 256 to 512 with different number of heads. We train the fixed head size models with a fixed embedding size of 256 and a head size of 32, with an increasing number of heads from 4 to 70. We notice that Transformers with a fixed head size and an embedding size of 256 have better performance than the baseline models with an embedding size of 512 (see Fig. 1). We repeat the similar experiment on the other two tasks, where for baseline we train , a 24 layer, 16 head Transformer with the standard head size heuristic, with embedding sizes from 512 to 1024. We compare it with the fixed head size model, with an embedding size of 512 and a head size of 128, with an increasing number of heads from 8 to 32. We again notice that the Transformers trained with a fixed head size and 512 embedding size have better performance than the baseline, (see Fig. 2).
Note that simply trying to increase the head size of the Transformers by decreasing the number of heads does not improve the performance, as decreasing the number of heads reduces the expressive power of the model (see Fig. 4 in the Appendix). Hence, both the head size and the number of heads needs to be set high enough for better performance.
4.3 Ablation
Increasing heads. From Table 1 and Fig. 0(a) we can see that increasing the number of heads hurts the performance of the Transformer after a certain number. We repeat the same experiments with the fixed head size Transformer, and present the results in Table 2(A) and Fig. 2(a). The results show that the performance of the modified model improves monotonically as the number of heads increase. This is because the model capacity (a function of the head size) is no longer reduced with the increasing number of heads.
Increasing head size. In Table 2(B) and Fig. 2(b), we present comparisons between models with different head sizes. This shows that the gains in the performance of the fixed head size models indeed come from adjusting the head size of the query, key and value layers in the attention unit. The table shows a clear trend of better performance with a larger head size, suggesting that it indeed is an important factor in the performance of the attention models.
5 Conclusion
In this paper we studied the representation power of the multihead self attention models and proved the lowrank bottleneck that results from a small head size in the multihead attention. We showed that the larger embedding size used in the current models is a consequence of this lowrank bottleneck in multihead attention layers. We propose to instead use fixed head size attention units, with the head size set to input sequence length, to avoid this bottleneck. We showed that it allows us to increase the number of heads without increasing the embedding size. As a consequence we are able to train Transformers with a smaller embedding size and fewer parameters, with better performance. In the future, it will be interesting to experiment with varying head sizes within an attention block and across layers. This requires further understanding of the role of each layer in computing the context, which is an interesting direction for the future work.
References
Appendix A Notation
Embedding size  

Number of layers  
Number of heads  
Sequence length  
Vocab size  
Head size 
Appendix B Proofs
Proof of Theorem 2.
First let us rewrite the MultiHead and FixedMultiHead layers as follows. The MultiHead layer can be rewritten as
where are matrices and , , and are matrices. We denote the collection of all parameter matrices as .
Similarly, rewrite the fixed head size attention layer as
where , and . Let be the collection of all these matrices.
The outline of the proof is basically a case analysis: we divide possible values of into three categories, and show in each case that there exists a such that . Here are the three cases:

Case 1: .

Case 2: , and there exists such that
is not skewsymmetric.

Case 3: , and all are skewsymmetric.
Case 1.
In the first case, we can choose any such that . Choose . Then, note that for any column stochastic matrix , we have . Therefore,
Case 2.
In cases where , since is full rank by assumption and each is at most rank , it follows that all columns in must be linearly independent. Therefore, for any ,
is a set of linearly independent vectors, because each
is a linear combination of column vectors of that are linearly independent of other column vectors in , .Now consider any , and , where . Define . Then, we have
Similarly, we can calculate
Notice that all the columns of and , from the second columns to the last ones, are the same. We now compare the first columns:
Recall that for any , are linearly independent, so if and only if all are zero. However, since there exists such that is not skewsymmetric, we can choose to be one that satisfies , hence making , therefore .
Case 3.
Now consider any , where and will be chosen later. Define , . Then, we have
Therefore, the first column of can be written as
Similarly, the first column of is
Since is skewsymmetric by assumption, we have for all . Recall that is rank by assumption, so is at least rank , so we can choose any such that .
If both and are nonzero, We can always choose such that and . This means that if we choose and scale ,
Then, consider the difference . Recall that for any , is independent of . This means that, to show , it suffices to show that
If we scale with large enough , the second term will dominate the first term and the first term will never be able to cancel the second one. Thus, by choosing large enough , we can make sure that the sum is nonzero.
Even in case where one of and is zero (say ), we can choose and use a similar scaling argument. By choosing large enough and , one can show that the difference is nonzero. ∎
Appendix C Experimental settings
For our experiments with the language modeling (LM1B dataset), we train 6 layer Transformer models. We use a batch size of 4096 and train for 250k steps. We use a learning rate of 0.1 with a linear warm up for the first 10k steps. We decay the learning rate with the square root of the number of steps. We train the baseline models, with the prevalent head size heuristic, with the embedding dimension varying from 256 to 512. We fix the width of the feed forward layer in the Transformer to be 1024. In addition, we use weight decay of 0.01 and dropout with probability of 0.1 on all the layers.
For our experiments with BERT, we follow the same experimental settings as in [devlin2018bert]. We present the key details here and refer the reader to [devlin2018bert]. We train with a batch size of 1024 for 450k steps with inputs of sequence length = 128 followed by 50k steps with inputs of sequence length 512. In contrast the BERT paper uses a batch size of 512, and does the pretraining for 900K steps with 128 sequence length inputs and 100k steps with 512 sequence length inputs. We train using ADAM with a learning rate of 1e4, and a linear warmup and decay schedule as in BERT. We use 5k warmup steps for the first stage, and a rewarmup of 3k steps for the second stage [you2019reducing]. Again, we use weight decay of 0.01 and dropout with probability of 0.1 on all the layers.
For the language modeling task, training is performed on 4 TPUv2 chips for a couple of hours. For BERT models training is performed on 16 TPUv3 chips in the first stage and 64 TPUv3 chips for the second stage. Pretraining with this configuration takes between 2 to 3 days. We did not attempt to find the optimal hyperparameters for the fixed head size architecture, and use the same hyperparameters as used for training the BERT models.
Appendix D Additional experimental results
# heads  8  12  16  20 

# params  214M  252M  290M  327M 
SQuAD  F1  90.350.14  90.480.09  90.920.14  90.890.08 
SQuAD  EM  83.370.12  83.670.03  84.160.35  84.290.16 
MNLI  84.40.2  84.40.2  84.70.1  85.10.4 
(A) Increasing number of heads 