Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

06/05/2020 ∙ by Krzysztof Choromanski, et al. ∙ Google University of Cambridge 10

Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding exploration of these models. To address this challenge, we present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR). Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. Furthermore, it provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence. It is also backwards-compatible with pre-trained regular Transformers. We demonstrate its effectiveness on the challenging task of protein sequence modeling and provide detailed theoretical analysis.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and related work

Transformers Vaswani et al. (2017); Dehghani et al. (2019)

are powerful neural network architectures that have become SOTA in several areas of machine learning including Natural Language Processing (NLP) (e.g. speech recognition

Luo et al. (2020)

), Neural Machine Translation (NMT)

Chen et al. (2018), document generation/summarization, time series prediction, generative modeling (e.g. image generation Parmar et al. (2018)), music generation Huang et al. (2019), and analysis of biological sequences Rives et al. (2019); Madani et al. (2020); Li (2019). Transformers rely on a trainable attention mechanism that specifies complex dependencies between the elements of each input sequence (e.g. amino acids within a protein). Unfortunately, a standard Transformer scales quadratically with the number of tokens in the input sequence, which is prohibitively expensive for large . Several solutions have been proposed to address this issue Beltagy et al. (2020); Gulati et al. (2020); Chan et al. (2020); Child et al. (2019). Most approaches restrict the attention mechanism to attend to local neighborhoods Parmar et al. (2018) or incorporate structural priors on attention such as sparsity Child et al. (2019), pooling-based compression Rae et al. (2020) clustering/binning/convolution techniques (e.g. Roy et al. (2020) which applies -means clustering to learn dynamic sparse attention regions, or Kitaev et al. (2020), where locality sensitive hashing is used to group together tokens of similar embeddings), sliding windows Beltagy et al. (2020), or truncated targeting Chelba et al. (2020). Thus these approaches do not aim to approximate regular attention, but rather propose simpler and more tractable attention mechanisms, often by incorporating additional constraints (e.g. identical query and key sets as in Kitaev et al. (2020)), or by trading regular attention with sparse attention using more layers Child et al. (2019). Furthermore, many of these works require special customized GPU operations (e.g. either writing C++ CUDA kernels Child et al. (2019) or using TVMs Beltagy et al. (2020)). Other techniques which aim to improve the time complexity of Transformers include reversible residual layers allowing for one-time activation storage in training Kitaev et al. (2020) and shared attention weights Xiao et al. (2019). These constraints may impede application to problems that involve long sequences, where approximations of the attention mechanism are not sufficient. Approximations based on truncated back-propagation Dai* et al. (2019) are also unable to capture long-distance correlations since the gradients are only propagated inside a localized window.

Recent work has demonstrated that Transformers fit to the amino acid sequences of single proteins learn to accurately predict information about protein structure and function, and can generate new sequences with specific properties Rives et al. (2019); Elnaggar et al. (2019); Madani et al. (2020). Approaches that encode 3D protein structural data via Transformer-based models demonstrate improved performance, despite the restriction of attention to the local structural neighborhoods of each node Du et al. (2020); Ingraham et al. (2019). These models provide initial promise for protein design applications, but their applicability beyond the design of single proteins is limited because they truncate sequences to 512 or 1024 amino acids. The ability to scale to longer sequences without imposing sparsity constraints would enable the use of Transformers to jointly model multiple concatenated protein sequences and the interactions between them. This follows recent works employing simpler statistical models that predict protein quaternary structure, protein-protein interactions and protein interaction networks from evolutionary sequence data Weigt et al. (2009); Hopf et al. (2012); Ovchinnikov et al. (2014); Bitbol et al. (2016); Cong et al. (2019).

In response, we present a new Transformer architecture, , based on Fast Attention Via Orthogonal Random features (FAVOR). Our proposed mechanism has several properties required by modern protein modeling: it scales linearly rather than quadratically in the number of tokens in the sequence (important for analysis involving compounds of protein molecules), is characterized by sub-quadratic space complexity, and does not incorporate any sparsity patterns priors. Furthermore, it provides strong theoretical guarantees: unbiased estimation of the regular attention matrix and uniform convergence. FAVOR is designed for long input sequences where the number of tokens satisfies , for embedding dimensionality

. In contrast to previous approaches, instead of simplifying regular attention via various structural priors (which can lead to different, potentially incompatible architectures), we show that it can be effectively approximated as it is, without any "liftings". This leads to our method being flexible: combined with small amounts of fine-tuning, the Performer is backwards-compatible with pretrained regular Transformers and can be also used beyond the Transformer scope as a more scalable replacement of regular attention, which itself has a wide variety of uses in computer vision

Fu et al. (2019)

, reinforcement learning

Zambaldi et al. (2019)

, and even combinatorial optimization

Vinyals et al. (2015). We demonstrate its effectiveness on the challenging task of protein modeling.

We show that regular attention can be considered a special case of a much larger class of kernel-driven attention mechanisms, Generalized Attention

(GA), and that all our results for regular attention can be directly translated also to this extended class. This observation enables us to explore a much larger class of attention models (Sec.

2.2). Interestingly, this is often enabled by the FAVOR mechanism, even if linear scaling is not required (Sec. 4). We highlight the following contributions:

  • We present Fast Attention Via Orthogonal Random features (FAVOR) (Sec. 2) which can be used as a replacement for regular attention. FAVOR is characterized by space complexity and time complexity, as compared to space complexity and time complexity for the regular algorithm (Sec. 2.6, Sec. 3).

  • We present a general class of kernel-based attention mechanisms, Generalized Attention (GA), which can be handled by FAVOR. Standard attention is a special case. (Sec. 2.2).

  • We provide strong theoretical guarantees regarding FAVOR: unbiasedness of our estimator of the attention matrix (Sec. 2.3) and uniform convergence (Sec. 3)

  • We empirically evaluate FAVOR via for protein modeling, demonstrating in practice all the aforementioned advantages (Sec. 4).

  • We show that our mechanism, implemented in Jax Frostig et al. (2018), is API-compatible with the regular Transformer, whose standard dot-product attention can be replaced by FAVOR with all other components of the architecture intact.

All proofs are given in full in the Appendix.

2 Generalized Attention via FAVOR mechanism

Below we describe in detail our FAVOR mechanism which is the backbone of our architecture. We also present a general class of kernel-based attentions, called Generalized Attention (GA) (which includes regular attention as a special case), where FAVOR can be applied.

2.1 Preliminaries - standard attention mechanism

Let be the size of an input sequence of tokens. Then regular dot-product attention Vaswani et al. (2017) is a mapping which accepts matrices as input where is the hidden dimension (dimension of the latent representation). Matrices are intermediate representations of the input and their rows can be interpreted as queries, keys and values of the continuous dictionary data structure respectively. Bidirectional (or non-directional Devlin et al. (2018)) dot-product attention has the following form:


where is applied elementwise,

is the all-ones vector of length

, and is a diagonal matrix with the input vector as the diagonal. The runtime complexity of computing (1) is because the attention matrix has to be computed and stored explicitly. Hence, in principle, dot-product attention of type (1) is incompatible with end-to-end processing of long sequences.

Another important type of attention is unidirectional dot-product attention which has the form:


where returns the lower-triangular part of the argument matrix including diagonal. As discussed in Vaswani et al. (2017), unidirectional attention is used for autoregressive generative modelling with Transformers when the output sequence is modelled as:

Therefore, the probability distribution over

can only depend on embeddings of tokens . Unidirectional attention is used as self-attention in generative Transformers as well as the decoder part of Seq2Seq Transformers Vaswani et al. (2017), while bidirectional attention is used in encoder self-attention and encoder-decoder attention in Seq2Seq architectures.

A line of work relies on sparse approximation of the matrix – either through restricting the sparsity pattern of Child et al. (2019) or learning it using Locality-Sensitive Hashing (LSH) techniques Kitaev et al. (2020). The latter results in runtime complexity. We will show that, without any structural assumptions, the matrix can be approximated up to any precision in time .

2.2 Generalized Attention (GA)

The idea of the attention mechanism is simple. New representations of tokens are obtained from previous ones by taking convex combinations of different value vectors with coefficients of the convex combinations interpreted as renormalized (i.e. all coefficients sum up to one) similarity measures between different tokens. High similarities imply strong attendance to the corresponding tokens. These similarity measures are simple ad-hoc “soft-max style" functions of a dot-product between query of token and key of token , namely:


where: . Note that is not a commutative operation here, and the -renormalizer is a technical modification to stabilize the range of and avoid very small/large values.

However, what if we use kernels instead of arbitrary similarity measures? Specifically, and are entangled through a valid kernel function, by defining the attention matrix as:


where is an arbitrary kernel function and . We call this attention mechanism defined above Generalized Attention (GA) parameterized by .

Next we show that not only can FAVOR approximate regular attention governed by Eq. 3, but it can be applied to GAs as long as the corresponding kernels can be effectively estimated via a random feature map mechanism Rahimi and Recht (2007), which is the case for most kernels used in practice. We will in fact show that regular attention is a special case of GA for a specific choice of , and Gaussian kernel .

2.3 Towards FAVOR: approximating attention with random features (RFs)

Instead of computing and storing the attention matrix

explicitly, we derive its unbiased stochastic approximation, which benefits from low-rank structure. We take our inspiration from a randomized scheme to train kernel Support Vector Machines with large training data

Rahimi and Recht (2007).

Let and denote the -th rows of matrices and respectively. For regular attention, the -th element of can be expressed as:

In other words, for , the attention matrix can be decomposed as:


for . Both and can be computed in time. Note that the -th element of matrix is the value of the Gaussian kernel with :


For GA, our analysis is similar. This time have nonzero entries of the form and (for regular attention we have: ) respectively and furthermore the Gaussian kernel is replaced by a general kernel , namely: , as in Equation 4.

In the reminder of this section we will derive an unbiased stochastic approximation of matrix based on low-rank decomposition of with the use of random feature maps Rahimi and Recht (2007).

For a given kernel , the random feature [RF] map corresponding to is a probabilistic embedding satisfying


where the expectation is with respect to the randomness of , and denotes the number of random features (if only approximates then we refer to the mechanism as an approximate random feature map). Efficient-to-compute random feature maps exist for virtually all classes of kernels used in machine learning, e.g. shift-invariant kernels Rahimi and Recht (2007), the pointwise nonlinear Gaussian kernel related to neural networks Gulrajani et al. (2017), and more, though the techniques used to derive these random mappings vary from class to class Choromanski et al. (2017). Even more interestingly, for most of these kernels, corresponding random feature maps have a similar structure, namely:


for some , , , distributions: , and constant . Here has rows and .

In particular, for the Gaussian kernel, we have and:


where and . This particular form of is a consequence of the celebrated Bochner’s Theorem Rahimi and Recht (2007). We now define and as:


Note that we have: and where and stand for the ith row of and respectively. Then according to Equation 8, we have: . Thus with , given as: , , we obtain:


We conclude that the attention matrix can be approximated without bias as: . We will leverage this unbiased approximate low-rank (if ) decomposition of in our algorithm, even though we will not explicitly compute .

Note that one can also define a valid kernel as: for as in Eq. 9 and an arbitrary . Such kernels cover in particular the family of Pointwise Nonlinear Gaussian Kernels Choromanski et al. (2017) (intrinsically related to nonlinear neural networks) such as arc-cosine kernels (e.g. angular kernels). Most of these kernels do not have closed-forms so computing exact GAs for them would not be possible, but of course computation is feasible with the presented mechanism.

2.4 Towards FAVOR: refinements via orthogonal random features

For isotropic (true for most practical applications, including regular attention), instead of sampling independently, we can use orthogonal random features (ORF) Yu et al. (2016); Choromanski et al. (2017, 2018b): these maintain (exactly or approximately) the marginal distributions of samples while enforcing that different samples are orthogonal. If we need , ORFs still can be used locally within each block of Yu et al. (2016).

ORFs were introduced to reduce the variance of Monte Carlo estimators

Yu et al. (2016); Choromanski et al. (2017, 2018b, 2019a); Rowland et al. (2019); Choromanski et al. (2018a, 2019b) and we show in Secs. 3 and 4 that they do indeed lead to more accurate approximations and substantially better downstream results. Below we breifly review the most efficient ORF mechanisms (based on their strengths and costs) that we will use in Sec. 2.6 in the analysis of FAVOR.

(1) Regular ORFs [R-ORFs]: Applies Gaussian orthogonal matrices Yu et al. (2016). Encodes matrix in space. Provides algorithm for computing in time for any . Gives unbiased estimation. Requires one-time preprocessing (Gram-Schmidt orthogonalization).

(2) Hadamard/Givens ORFs [H/G-ORFs]: Applies random Hadamard Choromanski et al. (2017)/Givens matrices Choromanski et al. (2019b). Encodes matrix in / space. Provides algorithm for computing in time for any . Gives small bias (going to with ).

2.5 FAVOR: Fast Attention via Orthogonal Random features

We are ready to present the full FAVOR algorithm. In the bidirectional case, our approximate attention computed by FAVOR is given as:


where . The placement of brackets determines the order in which computations are conducted. Note that we never explicitly compute and consequently, avoid time complexity and storing the approximate attention matrix (see: Sec. 2.6 for rigorous analysis).

Input :  , - binary flag.
Result: if , otherwise.
Compute as explained in Sec. 2.3;
Compute according to (11) and take ;
if  then

and its prefix-sum tensor

according to (14);
end if
return ;
Algorithm 1 FAVOR (bidirectional or unidirectional).

2.5.1 Prefix-sums for unidirectional FAVOR

For the unidirectional case, our analysis is similar but this time our goal is to compute without constructing and storing the -sized matrix explicitly, where . In order to do so, observe that :


where are 3d-tensors. Each slice is therefore a result of a prefix-sum (or cumulative-sum) operation applied to : . An efficient algorithm to compute the prefix-sum of elements takes total steps and time when computed in parallel Ladner and Fischer (1980); Cormen et al. (2009). See Algorithm 1 for the whole approach.

2.6 Time and space complexity analysis

We see that a variant of bidirectional FAVOR using regular RFs (based on iid samples) or R-ORFs has space complexity as opposed to space complexity of the baseline. Unidirectional FAVOR using fast prefix-sum precomputation in parallel Ladner and Fischer (1980); Cormen et al. (2009) has space complexity to store which can be reduced to by running a simple (though non-parallel in ) aggregation of without storing the whole tensor in memory. From Sec. 2.4, we know that if instead we use G-ORFs, then space complexity is reduced to and if the H-ORFs mechanism is used, then space is further reduced to . Thus for all our variants provide substantial space complexity improvements since they do not need to store the attention matrix explicitly.

The time complexity of Algorithm 1 is (note that constructing and can be done in time via Eq. 11 if samples from and can be obtained in time and respectively (which is the case for all practical applications). Note that the time complexity of our method is much lower than of the baseline for .

As explained in Sec. 2.4, the R-ORF mechanism incurs an extra one-time cost (negligible compared to the term for ). H-ORFs or G-ORFs do not have this cost, and when FAVOR uses them, computing and can be conducted in time as opposed to (see: Sec. 2.4). Thus even though H/G-ORFs do not change the asymptotic time complexity, they improve the constant factor from the leading term. This plays an important role for training very large models.

The number of random features allows a trade-off between computational complexity and the level of approximation: bigger results in higher computation costs, but also in a lower variance of the estimate of . In the next section we will show that in practice we can take .

Observe that the algorithm obtained is highly-parallelizable, and benefits from fast matrix multiplication and broadcasted operations on GPUs or TPUs.

3 Theoretical convergence analysis

In contrast to other methods approximating the attention matrix , our algorithm provides provable strong uniform convergence theoretical guarantees for compact domains. We show that , the optimal number of random features, does not depend on but only on . In fact, we prove that if we take , then with -time, we can approximate up to any precision, regardless of the number of tokens . In order to provide those guarantees for FAVOR, we leverage recent research on the theory of negative dependence for ORFs Lin et al. (2020). The following is true:

Theorem 1 (Uniform convergence of FAVOR).

Take the generalized attention mechanism defined by (see: Sec. 2.2

) and a radial basis function (RBF) kernel

Choromanski et al. (2018b) with corresponding spectral distribution (e.g. Gaussian kernel for which ). Assume that the rows of matrices and are taken from a ball of radius , centered at (i.e. norms of queries and keys are upper-bounded by ). Define and take , . Then for any , and the number of random features for the following holds: with any constant probability, where approximates generalized attention matrix via FAVOR with R-ORFs.

The result holds in particular for regular attention using Gaussian kernels (see: Sec. 2.2) for which since .

4 Experiments

We implement our setup on top of pre-existing Transformer training code in Jax Frostig et al. (2018)

, and complement our theory with empirical evidence to demonstrate FAVOR’s practicality in the protein setting. Unless explicitly stated, a Performer replaces only the attention component with FAVOR, while all other components are exactly the same as for the regular Transformer. Furthermore, since we use the cross-entropy loss in our generative training experiments, we use standard the accuracy metric as defined from supervised learning.

4.1 Computation costs

We compared speed-wise the backward pass, as it is one of the main computational bottlenecks during training, for a Transformer and a Performer in two settings: when the architecture is mostly composed of attention while other dimensions are small , as well as the regular default size , where denotes the width of the MLP layers of the Transformer. We observed (Fig. 1) that in terms of , the Performer reaches nearly linear time complexity as opposed to the Transformer’s quadratic time complexity. Furthermore, the Performer’s memory consumption is sub-quadratic (as it does not store the explicit sized attention matrix), which allows both higher batch sizes and longer sequence lengths. The combination of both memory and backward pass efficiencies for large has profound implications for training speed, as it allows respectively, large batch training and lower wall clock time per gradient step, contributing to total train time reduction. We present additional results, including the forward pass, in the Appendix A by varying layers and architecture sizes as well.

Figure 1: Comparison of Transformer and Performer in terms of backward pass speed and maximum allowed. Plots shown up to when a model produces an out of memory error on a V100 GPU with 16GB. Best in color.

4.2 Approximation error and compatibility with regular Transformer

We further examined the approximation error of the attention matrix implicitly defined in FAVOR in Fig. 2 (and in Fig. 8 in Appendix B), which thus directly affects the accuracy of FAVOR’s output. We demonstrate that orthogonal features generally produce lower error than unstructured features.

Figure 2: Approximation errors for both the attention matrix and output of the mechanism itself. We took , and varied the number of random features

. Standard deviations shown across 10 samples.

Figure 3: We transferred the original pretrained Transformer’s weights into the Performer, which produces an initial non-zero 0.07 accuracy (dotted orange line). However, once fine-tuned, the Performer quickly recovers accuracy in less than 1/6th the original number of gradient steps.

Notice that the accuracy can be further boosted by applying a resampling strategy that reconstructs samples periodically. We set this period as a hyperparameter of our overall algorithm.

The approximation error can propagate when applying the other components (MLPs, multiple heads, multiple layers, etc.) of a Transformer, which we demonstrate in Fig. 7 (Appendix). This implies we cannot immediately directly transfer the weights from a pretrained Transformer onto the Performer. However, this can be resolved by finetuning the Performer on the trained task. We demonstrate this technique for a pretrained BERT model Devlin et al. (2018) on the LM1B dataset Chelba et al. (2014) in Fig. 3.

4.3 Generalized Attention

We investigated Generalized Attention mechanisms (Sec. 2.2) on protein datasets Consortium (2019) of up to length 512 for various kernel functions. Using hyperparameter sweeps across multiple variables in FAVOR, we compared several kernels and also renormalization on/off (Fig. 4, corresponds to applying operator in attention, as for the standard mechanism; though we noticed that disabling it does not necessarily hurt accuracy) to produce the best training configuration for the Performer. We found the sigmoid kernel with renormalization ON was the optimal configuration for the Performer.

Figure 4: Black dashed line corresponds to the baseline using regular attention. To emphasize the highest accuracy runs, we set y-axis to be log-scale. We tested four kernels defined by four different functions (see: Sec. 2.2): sigmoid, exponential, identity and cosine.

4.4 Training on concatenated protein sequences

Finally, we demonstrate that the Performer can model multiple concatenated protein sequences as required to model and predict interactions among groups of proteins from sequence data. For this proof of principle study, we use protein sequences from the Jan. 2019 release of Trembl Consortium (2019), and concatenated protein sequences to length , long enough to model protein interaction networks without the large sequence alignments required by existing methods Cong et al. (2019). We train models on a Cloud TPU v3, containing 16GB RAM per chip. At this length, a baseline Transformer overloads memory even at a batch size of per chip by a wide margin. Thus as a baseline we were forced to use a significantly smaller variant, reducing to . Meanwhile, the Performer trains efficiently at a batch size of 16 per chip using the standard architecture. We see in Fig. 5 that the Transformer is quickly bounded at , while the Performer is able to train continuously, increasing its performance as training progresses.

Figure 5: Generative training for Transformer (Small) and Performer.

5 Conclusion

We presented , a new type of Transformer, relying on our Fast Attention Via Orthogonal Random features (FAVOR) mechanism to significantly improve space and time complexity of regular Transformers. Our mechanism is to our knowledge the first unbiased estimation of the original algorithm with linear space and time complexity with respect to . Further, FAVOR could be applied to other tasks of approximate attention, including hierarchical attention networks (HANS) Yang et al. (2016), graph attention networks Velickovic et al. (2018), image processing Fu et al. (2019), and reinforcement learning/robotics Tang et al. (2020).

6 Broader impact

We believe that the presented algorithm can be impactful in various ways:

Biology and Medicine: Our method has the potential to directly impact research on biological sequence analysis by enabling the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix. The initial application that we consider is the prediction of interactions between proteins on the proteome scale. Recently published approaches require large evolutionary sequence alignments, a bottleneck for applications to mammalian genomes Cong et al. (2019). The potentially broad translational impact of applying these approaches to biological sequences was one of the main motivations of this work. We believe that modern bioinformatics can immensely benefit from new machine learning techniques with Transformers being among the most promising. Scaling up these methods to train faster more accurate language models opens the door to the ability to design sets of molecules with pre-specified interaction properties. These approaches could be used to augment existing physics-based design strategies that are of critical importance for example in the development of new nanoparticle vaccines Marcandalli et al. (2019).

Environment: As we have shown, Performers with FAVOR are characterized by much lower compute costs and substantially lower space complexity which can be directly translated to emission reduction Strubell et al. (2019) and lower energy consumption You et al. (2020), as regular Transformers require very large computational resources.

Research on Transformers: We believe that our results can shape research on efficient Transformers architectures, guiding the field towards methods with strong mathematical foundations. Our research may also hopefully extend Transformers also beyond their standard scope (e.g. by considering the Generalized Attention mechanism and connections with kernels). Exploring scalable Transformer architectures that can handle of the order of magnitude few thousands and more, preserving accuracy of the baseline at the same time, is a gateway to new breakthroughs in bio-informatics, e.g. language modeling for proteins, as we explained in the paper. Our presented method can be potentially a first step.

Backward Compatibility: Our Performer can be used on the top of a regular pre-trained Transformer as opposed to other Transformer variants. Even if up-training is not required, FAVOR can be still used for fast inference with no loss of accuracy. We think about this backward compatibility as a very important additional feature of the presented techniques that might be particularly attractive for practitioners.

Attention Beyond Transformers: Finally, FAVOR can be applied to approximate exact attention also outside the scope of Transformers. This opens a large volume of new potential applications including: hierarchical attention networks (HANS) Yang et al. (2016), graph attention networks Velickovic et al. (2018), image processing Fu et al. (2019), and reinforcement learning/robotics Tang et al. (2020).


  • [1] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. CoRR abs/2004.05150. External Links: Link, 2004.05150 Cited by: §1.
  • [2] A. Bitbol, R. S. Dwyer, L. J. Colwell, and N. S. Wingreen (2016) Inferring interaction partners from protein sequences. Proceedings of the National Academy of Sciences 113 (43), pp. 12180–12185. Cited by: §1.
  • [3] W. Chan, C. Saharia, G. E. Hinton, M. Norouzi, and N. Jaitly (2020) Imputer: sequence modelling via imputation and dynamic programming. CoRR abs/2002.08926. External Links: Link, 2002.08926 Cited by: §1.
  • [4] C. Chelba, M. X. Chen, A. Bapna, and N. Shazeer (2020)

    Faster transformer decoding: n-gram masked self-attention

    CoRR abs/2001.04589. External Links: Link, 2001.04589 Cited by: §1.
  • [5] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson (2014) One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pp. 2635–2639. Cited by: §4.2.
  • [6] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. F. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes (2018) The best of both worlds: combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 76–86. External Links: Link, Document Cited by: §1.
  • [7] R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. CoRR abs/1904.10509. External Links: Link, 1904.10509 Cited by: §1, §2.1.
  • [8] K. Choromanski, C. Downey, and B. Boots (2018)

    Initialization matters: orthogonal predictive state recurrent neural networks

    In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.4.
  • [9] K. M. Choromanski, M. Rowland, and A. Weller (2017) The unreasonable effectiveness of structured random orthogonal embeddings. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 219–228. External Links: Link Cited by: §2.3, §2.3, §2.4, §2.4, §2.4.
  • [10] K. Choromanski, A. Pacchiano, J. Pennington, and Y. Tang (2019) KAMA-NNs: low-dimensional rotation based neural networks. In

    The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan

    , K. Chaudhuri and M. Sugiyama (Eds.),
    Proceedings of Machine Learning Research, Vol. 89, pp. 236–245. External Links: Link Cited by: §2.4.
  • [11] K. Choromanski, M. Rowland, W. Chen, and A. Weller (2019) Unifying orthogonal Monte Carlo methods. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1203–1212. External Links: Link Cited by: §2.4, §2.4.
  • [12] K. Choromanski, M. Rowland, T. Sarlós, V. Sindhwani, R. E. Turner, and A. Weller (2018) The geometry of random features. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, A. J. Storkey and F. Pérez-Cruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, pp. 1–9. External Links: Link Cited by: §2.4, §2.4, Theorem 1.
  • [13] Q. Cong, I. Anishchenko, S. Ovchinnikov, and D. Baker (2019) Protein interaction networks revealed by proteome coevolution. Science 365 (6449), pp. 185–189. Cited by: §1, §4.4, §6.
  • [14] U. Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47 (D1), pp. D506–D515. Cited by: §4.3, §4.4.
  • [15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2009) Introduction to algorithms, 3rd edition. MIT Press. External Links: Link, ISBN 978-0-262-03384-8 Cited by: §2.5.1, §2.6.
  • [16] Z. Dai*, Z. Yang*, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-XL: language modeling with longer-term dependency. External Links: Link Cited by: §1.
  • [17] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1.
  • [18] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §2.1, §4.2.
  • [19] Y. Du, J. Meier, J. Ma, R. Fergus, and A. Rives (2020) Energy-based models for atomic-resolution protein conformations. arXiv preprint arXiv:2004.13167. Cited by: §1.
  • [20] A. Elnaggar, M. Heinzinger, C. Dallago, and B. Rost (2019) End-to-end multitask learning, from protein language to protein features without alignments. bioRxiv, pp. 864405. Cited by: §1.
  • [21] R. Frostig, M. Johnson, and C. Leary (2018) Compiling machine learning programs via high-level tracing. External Links: Link Cited by: 5th item, §4.
  • [22] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

    pp. 3146–3154. Cited by: §1, §5, §6.
  • [23] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang (2020) Conformer: convolution-augmented transformer for speech recognition. External Links: 2005.08100 Cited by: §1.
  • [24] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5767–5777. External Links: Link Cited by: §2.3.
  • [25] T. A. Hopf, L. J. Colwell, R. Sheridan, B. Rost, C. Sander, and D. S. Marks (2012) Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149 (7), pp. 1607–1621. Cited by: §1.
  • [26] C. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck (2019) Music transformer: generating music with long-term structure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1.
  • [27] J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola (2019) Generative models for graph-based protein design. In Advances in Neural Information Processing Systems, pp. 15794–15805. Cited by: §1.
  • [28] N. Kitaev, L. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §2.1, §A.
  • [29] R. E. Ladner and M. J. Fischer (1980-10) Parallel prefix computation. J. ACM 27 (4), pp. 831–838. External Links: ISSN 0004-5411, Link, Document Cited by: §2.5.1, §2.6.
  • [30] J. Li (2019) Universal transforming geometric network. CoRR abs/1908.00723. External Links: Link, 1908.00723 Cited by: §1.
  • [31] H. Lin, H. Chen, T. Zhang, C. Laroche, and K. Choromanski (2020) Demystifying orthogonal Monte Carlo and beyond. CoRR abs/2005.13590. Cited by: §3, §C.
  • [32] H. Luo, S. Zhang, M. Lei, and L. Xie (2020) Simplified self-attention for transformer-based end-to-end speech recognition. CoRR abs/2005.10463. External Links: Link, 2005.10463 Cited by: §1.
  • [33] A. Madani, B. McCann, N. Naik, N. S. Keskar, N. Anand, R. R. Eguchi, P. Huang, and R. Socher (2020) ProGen: language modeling for protein generation. CoRR abs/2004.03497. External Links: Link, 2004.03497 Cited by: §1, §1.
  • [34] J. Marcandalli, B. Fiala, S. Ols, M. Perotti, W. de van der Schueren, J. Snijder, E. Hodge, M. Benhaim, R. Ravichandran, L. Carter, et al. (2019) Induction of potent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratory syncytial virus. Cell 176 (6), pp. 1420–1431. Cited by: §6.
  • [35] S. Ovchinnikov, H. Kamisetty, and D. Baker (2014) Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information. Elife 3, pp. e02030. Cited by: §1.
  • [36] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4052–4061. External Links: Link Cited by: §1.
  • [37] J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap (2020) Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [38] A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.), pp. 1177–1184. External Links: Link Cited by: §2.2, §2.3, §2.3, §2.3, §2.3, §C.
  • [39] A. Rives, S. Goyal, J. Meier, D. Guo, M. Ott, C. Zitnick, J. Ma, and R. Fergus (2019-04)

    Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

    bioArxiv. External Links: Document Cited by: §1, §1.
  • [40] M. Rowland, J. Hron, Y. Tang, K. Choromanski, T. Sarlós, and A. Weller (2019) Orthogonal estimation of Wasserstein distances. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, pp. 186–195. External Links: Link Cited by: §2.4.
  • [41] A. Roy, M. Saffar, A. Vaswani, and D. Grangier (2020) Efficient content-based sparse attention with routing transformers. CoRR abs/2003.05997. External Links: Link, 2003.05997 Cited by: §1.
  • [42] E. Strubell, A. Ganesh, and A. McCallum (2019)

    Energy and policy considerations for deep learning in NLP

    CoRR abs/1906.02243. External Links: Link, 1906.02243 Cited by: §6.
  • [43] Y. Tang, D. Nguyen, and D. Ha (2020) Neuroevolution of self-interpretable agents. CoRR abs/2003.08165. External Links: Link, 2003.08165 Cited by: §5, §6.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1, §2.1, §2.1.
  • [45] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §5, §6.
  • [46] O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2692–2700. Cited by: §1.
  • [47] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa (2009) Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences 106 (1), pp. 67–72. Cited by: §1.
  • [48] T. Xiao, Y. Li, J. Zhu, Z. Yu, and T. Liu (2019) Sharing attention weights for fast transformer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, S. Kraus (Ed.), pp. 5292–5298. External Links: Link, Document Cited by: §1.
  • [49] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy (2016) Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, K. Knight, A. Nenkova, and O. Rambow (Eds.), pp. 1480–1489. External Links: Link, Document Cited by: §5, §6.
  • [50] H. You, C. Li, P. Xu, Y. Fu, Y. Wang, X. Chen, R. G. Baraniuk, Z. Wang, and Y. Lin (2020) Drawing early-bird tickets: toward more efficient training of deep networks. In International Conference on Learning Representations, External Links: Link Cited by: §6.
  • [51] F. X. Yu, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar (2016) Orthogonal random features. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1975–1983. Cited by: §2.4, §2.4, §2.4.
  • [52] V. F. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. P. Reichert, T. P. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, and P. W. Battaglia (2019) Deep reinforcement learning with relational inductive biases. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1.

APPENDIX: Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

A Extended computation costs

In this subsection, we empirically measure computational costs in terms wall clock time for both the forward and backward passes when we replace the attention mechanism on a regular Transformer-based architecture. Since some of the computational bottleneck in the Transformer may originate from the extra feed-forward layers [28], we thus focus on the attention part of our mechanism (which is primarily dependent on ) by varying both the number of layers and sequence length, while fixing the other components to be relatively minor - i.e. with a batch size of 1.

Figure 6: Using log-scale with time in seconds, we see that both the forward and backward passes scale in nearly linear time with respect to the length , allowing for fast inference and training respectively. The linear regime begins to take place at approximately length size, where vanilla Transformers begin to overload GPU memory.

B Extended approximation results

Although mentioned previously (Sec. 4.2) that the Performer with additional finetuning is backwards compatible with the Transformer, we demonstrate below error propagation due to non-attention components of the Transformer is one of the primary reasons that pretrained Transformer weights cannot be immediately used for inference on the corresponding Performer.

Figure 7: Output approximation errors between a vanilla Transformer and a Performer (with orthogonal features) for varying numbers of layers.

We further extend the hyperparameter sweep setting from Figure 4 in the main body of the paper, and see that across varying hyperparameters, training with orthogonal features is generally is the most accurate.

Figure 8: Orthogonality vs Unstructured usage when varying across various hyperparameters.

C Theoretical results

We provide here the proof of Theorem 1 from the main body.


We consider first the case of the default FAVOR setting with R-ORF mechanism turned on. We rely on Theorem 3 from [31]. Note that we can apply it in our case, since for RBF kernels the corresponding function is (thus in particular it is bounded). Also, it is not hard to observe (see for instance analysis in Claim 1 from [38]) that . Using Theorem 3 from [31], we conclude that:


with any constant probability as long as , where and is the diameter of the smallest ball containing all vectors of the form . Since , we conclude that and thus one can take . We have:


Taking completes the proof. ∎