Transformers 36, 68, 32] and become the state-of-the-art model on a wide range of applications across different data modalities, from language [20, 1, 15, 10, 54, 4, 7, 18] to images [21, 39, 70, 55, 51, 24], videos [3, 40], point clouds [87, 27], and protein sequence [57, 30]
. In addition to their excellent performance on supervised learning tasks, transformers can also effectively transfer the learned knowledge from a pretraining task to new tasks with limited or no supervision[52, 53, 20, 85, 38]. At the core of transformers is the dot-product self-attention, which mainly accounts for the success of transformer models [11, 48, 37]
. This dot-product self-attention learn self-alignment between tokens in an input sequence by estimating the relative importance of a given token with respect to all other tokens. It then transform each token into a weighted average of the feature representations of other tokens where the weight is proportional to a importance score between each pair of tokens. The importance scores in self-attention enable a token to attend to other tokens in the sequence, thus capturing the contextual representation[5, 75, 34].
Given an input sequence of
feature vectors, self-attention computes the output sequencefrom as follows:
Step 1: Projecting the input sequence into different subspaces. The input sequence is transformed into the query matrix , the key matrix , and the value matrix
via three linear transformations
where , and are the weight matrices. We denote , and , where the vectors for are the query, key, and value vectors, respectively.
Step 2: Computing the output as a weighted average. The output sequence is then given by
where the softmax function is applied to each row of the matrix . For each query vector , , Eqn. (1) can be written in the vector form to compute the output vector as follows
The matrix and its component for are the attention matrix and attention scores, respectively. The self-attention computed by equations (1) and (2) is called the dot-product attention or softmax attention. In our paper, we refer a transformer that uses this attention as the baseline transformer with the dot-product attention or the dot-product transformer. The structure of the attention matrix after training governs the ability of the self-attention to capture contextual representation for each token.
Multi-head Attention: Each output sequence forms an attention head. Multi-head attention concatenates multiple heads to compute the final output. Let be the number of heads and be the projection matrix for the output. The multi-head attention is defined as
The capacity of the attention mechanism and its ability to learn diverse syntactic and semantic relationships determine the success of transformers [69, 76, 14, 77, 28]. However, equations (1) and (2) implies that the dot-product attention assumes the features in , as well as the features in , are independent. Thus, the dot-product attention fail to capture the correlations between these features, limiting its representation capacity and inhibit the performance of transformers on practical tasks where there is no guarantee that independent features can learned from complex data. One solution to capture correlations between features and is to introduce covariance matrices into the formulation of the dot-product attention with the cost of significantly increasing of the computational complexity. Also, choosing good covariance matrices is difficult.
In this paper, we first establish a correspondence between self-attention and nonparametric kernel regression. Under this new perspective of self-attention, we explain the limitation of the dot-product self-attention that it may fail to capture correlations between the features in the query and key vectors. We then leverage the generalized Fourier integral theorems, which can automatically capture these correlations, and derive the generalized Fourier integral estimators for the nonparametric regression problem. Using this new density estimator, we propose the FourierFormer, a novel class of transformers that can capture correlations between features in the query and key vectors of self-attention. In summary, our contribution is three-fold:
We derive the formula of self-attention from solving a nonparametric kernel regression problem, thus providing a nonparametric regression interpretation to study and further develop self-attention.
We develop the generalized Fourier integral estimators for the nonparametric regression problem and provide theoretical guarantees for these estimator.
We propose the FourierFormer whose attentions use the generalized Fourier integral estimators to capture more efficiently correlations between features in the query and key vectors.
Finally, we empirically show that the FourierFormer attains significantly better accuracy than the baseline transformer with the dot-product attention on a variety of tasks including the WikiText language modeling and ImageNet image classsification. We also demonstrate in our experiments that FourierFormer helps reduce the redundancy between attention heads.
Organization We structure this paper as follows: In Section 2, we present the correspondence between self-attention and nonparametric kernel regression. In Section 3, we discuss the generalized Fourier integral estimators and define the FourierFormer. We validate and empirically analyze the advantages of FourierFormer in Section 4. We discuss related works in Section 5. The paper ends with concluding remarks. Technical proofs and more experimental details are provided in the Appendix.
Notation For any , we denote . For any , denotes the space of real-valued functions on that are integrable. For any two sequences , we denote to mean that for all where is some universal constant.
2 A Nonparametric Regression Interpretation of Self-attention
In this section, we establish the connection between self-attention and nonparametric kernel regression. In particular, we derive the self-attention in equation (2) as a nonparametric kernel regression in which the key vectors and value vectors are training inputs and training targets, respectively, while the query vectors and the output vectors form a set of new inputs and their corresponding targets that need to be estimated, respectively, for . In general, we can view the training set for to come from the following nonparametric regression model:
where are independent noises such that . Furthermore, we consider a random design setting where the key vectors are i.i.d. samples from the distribution that admits as density function. By an abuse of notation, we also denote as the joint density where the key and value vectors are i.i.d. samples from. Here, is a true but unknown function and we would like to estimate it.
Nadaraya–Watson estimator: Our approach to estimate the function is based on Nadaraya–Watson’s nonparametric kernel regression approach . In particular, from the nonparametric regression model (3), we have for all . Therefore, it is sufficient to estimate the conditional distribution of the value vectors given the key vectors. Given the density function of the key vectors and the joint density of the key and value vectors, for any pair of vectors generate from model (3) we have
The formulation (4) of the conditional expectation indicates that as long as we can estimate the joint density function and the marginal density function , we are able to obtain an estimation for the conditional expectation and thus for the function . This approach is widely known as Nadaraya–Watson’s nonparametric kernel regression approach.
Kernel density estimator: To estimate and
, we employ the kernel density estimation approach[58, 49]. In particular, by using the isotropic Gaussian kernel with bandwidth , we have the following estimators of and :
where is the isotropic multivariate Gaussian density function with diagonal covariance matrix . Given the kernel density estimators (5), we obtain the following estimation of the function :
Connection between Self-Attention and nonparametric regression: By plugging the query vectors into the function in equation (6), we obtain that
When we choose where is the dimension of and , equation (8) matches equation (2) of self-attention, namely, . Thus, we have shown that self-attention performs nonparametric regression using isotropic Gaussian kernels.
The assumption that is normalized is to recover the pairwise dot-product attention in transformers. In general, this assumption is not necessary. In fact, the isotropic Gaussian kernel in equation (7) is more desirable than the dot-product kernel in equation (8) of the pairwise dot-product attention since the former is Lipschitz while the later is not Lipschitz . The Lipschitz constraint helps improve the robustness of the model [13, 73, 2] and stabilize the model training .
Limitation of Self-Attention: From our nonparametric regression interpretation, self-attention is derived from the use of isotropic Gaussian kernels for kernel density estimation and nonparametric regression estimation, which may fail to capture the complex correlations between features in and [80, 29]. Using multivariate Gaussian kernels with dense covariance matrices can help capture such correlations; however, choosing good covariance matrices is challenging and inefficient [79, 65, 9]. In the following section, we discuss the Fourier integral estimator and its use as a kernel for computing self-attention in order to overcome these limitations.
3 FourierFormer: Transformer via Generalized Fourier Integral Theorem
In the following, we introduce generalized integral theorems that are able to capture the complex interactions among the features of the queries and keys. We then apply these theorems to density estimation and nonparametric regression problems. We also establish the convergence rates of these estimators. Given these density estimators, we introduce a novel family of transformers, named FourierFormer, that integrates the generalized Fourier integral theorem into the dot-product attention step of the standard transformer.
3.1 (Generalized) Fourier Integral Theorems and Their Applications
. It is a combination of Fourier transform and Fourier inverse transform. In particular, for any function, the Fourier integral theorem is given by
where and . Equation (9) suggests that
can be used as an estimator of the function .
Benefits of the Fourier integral over Gaussian kernel: There are two important benefits of the estimator : (i) it can automatically preserve the correlated structure lying within even when is very complex and high dimensional function. It is in stark contrast to the standard kernel estimator built based on multivariate Gaussian kernel where we need to choose good covariance matrix in the multivariate Gaussian kernel to guarantee such estimator to work well. We note that as the standard soft-max Transformer is constructed based on the multivariate Gaussian kernel, the issue of choosing good covariance matrix in dot-product transformer is inevitable; (ii) The product of sinc kernels in the estimator does not decay to a point mass when . It is in stark difference from the multivariate Gaussian kernel estimator, which converges to a point mass when the covariance matrix goes to 0. It indicates that is a non-trivial estimator of the function . Finally, detailed illustrations of these benefits of the Fourier integral over Gaussian kernel in density estimation and nonparametric regression problems, which we have just shown to have connection to the self-attention in transformer, can be found in Section 8 in .
Generalized Fourier integral estimator: Borrowing the above benefits of Fourier integral estimator , in the paper we would like to consider a generalization of that estimator, named generalized Fourier integral estimator, which is given by:
where and is a given function. When for all , the generalized Fourier integral estimator becomes the Fourier integral estimator . Under appropriate conditions on the function (see Theorem 1 in Section 3.1.1 and Theorem 3 in Appendix A.1), the estimator converges to the true function , namely,
We name the above limit as generalized Fourier integral theorem. Furthermore, the estimator also inherits similar aforementioned benefits of the Fourier integral estimator . Therefore, we will use the generalized Fourier integral theorem as a building block for constructing density estimators and nonparametric regression estimators, which are crucial to develop the FourierFormer in Section 3.2.
3.1.1 Density Estimation via Generalized Fourier Integral Theorems
We first apply the generalized Fourier integral theorem to the density estimation problem. To ease the presentation, we assume that are i.i.d. samples from a distribution admitting density function where is the dimension. Inspired by the generalized Fourier integral theorem, we obtain the following generalized Fourier density estimator of as follows:
where and for all . To quantify the error between the generalized Fourier density estimator and the true density , we utilize mean integrated squared errors (MISE) , which is given by:
We start with the following bound on the MISE between and .
Assume that for all and for some . Then, there exist universal constants C and C’ depending on and such that
to balance the bias and variance in the bound of MISE in Theorem1, we have the optimal as . With that choice of , the MISE rate of is . Second, when for and , the assumptions in Theorem 1 are satisfied when . Under this case, the MISE rate of is . However, these assumptions do not satisfy when and , which is due to the limitation of the current proof technique of Theorem 1 that is based on Taylor expansion of the estimator .
To address the limitation of the Taylor expansion technique, we utilize the Plancherel theorem in Fourier analysis to establish the MISE rate of when and . The details of the theoretical analyses for such setting are in Appendix A.
3.2 FourierFormer: Transformers with Fourier Attentions
Motivated by the preservation of the correlated structure of the function from the generalized Fourier integral theorem as well as the theoretical guarantees of density estimators, in this section we adapt the nonparametric regression interpretation of self-attention in Section 2 and propose the generalized Fourier nonparametric regression estimator in Section 3.2.1. We also establish the convergence properties of that estimator. Then, based on generalized Fourier nonparametric regression estimator, we develop the Fourier Attention and its corresponding FourierFormer in Section 3.2.2.
3.2.1 Nonparametric Regression via Generalized Fourier Integral Theorem
We now discuss an application of the generalized Fourier integral theorems to the nonparametric regression setting (3), namely, we assume that are i.i.d. samples from the following nonparametric regression model:
where are independent noises such that and the key vectors are i.i.d. samples from . Given the generalized Fourier density estimator (12), following the argument in Section 2, the Nadaraya–Watson estimator of the function based on the generalized Fourier density estimator is given by:
The main difference between the generalized Fourier nonparametric regression estimator in equation (14) and the estimator in equation (6) is that the estimator utilizes the generalized Fourier density estimator to estimate the conditional distribution of the value vectors given the key vectors instead of the isotropic Gaussian kernel density estimator as in . As we highlighted in Section 3, an important benefit of the generalized Fourier density estimator is that it can capture the complex dependencies of the features of the value vectors and the key vectors while the Gaussian kernel needs to have good covariance matrix to do that, which is computationally expensive in practice.
We now have the following result establishing the mean square error (MSE) of .
Assume that for all and for any for some . Then, for any , there exist universal constants such that the following holds:
where . Here, the outer expectation is taken with respect to the key vectors and the noises .
Proof of Theorem 2 is in Appendix B.3. A few comments with Theorem 2 are in order. First, by choosing to balance the bias and variance in the bound of the MSE of the nonparametric generalized Fourier estimator , we have the optimal radius as . With that choice of the optimal radius , the rate of is . Second, when for , the assumption on the function of Theorem 2 is satisfied with . Under this case, the rate of becomes . In Appendix A, we also provide the rate of when for some , which includes the original Fourier integral theorem.
Given the generalized Fourier nonparametric regression estimator in equation (14), by plugging the query values into that function, we obtain the following definition of the Fourier attention:
Definition 1 (Fourier Attention).
A Fourier attention is a multi-head attention that does nonparametric regression using the generalized Fourier nonparametric regression estimator . The output of the Fourier attention is then computed as
Given the Fourier Attention in Definition 1, we then give the definition of FourierFormer as follows.
Definition 2 (FourierFormer).
A FourierFormer is a transformer that uses Fourier attention to capture dependency between tokens in the input sequence and the correlation between features in each token.
Remark 2 (The Nonnegativity of the Fourier Kernel).
The density estimation via generalized Fourier integral theorem in Section 3.1.1 does not require the generalized Fourier density estimator to be nonnegative. However, empirically, we observe that negative density estimator can cause instability in training the FourierFormer. Thus, in FourierFormer, we choose the function to be a nonnegative function to enforce the density estimator to be nonnegative. In particular, we choose to be power functions of the form , where is an positive integer. Note that when and , the kernels in our generalized Fourier integral estimators are the well-known Fejer-de la Vallee Poussin and Jackson-de la Vallee Poussin kernels .
3.3 An Efficient Implementation of the Fourier Attention
The Fourier kernel is implemented efficiently in the C++/CUDA extension developed by Pytorch. The idea is similar to the function cdist , which computes the p-norm distance between each pair of the two collections of row vectors. In our case, we aim to compute kernel functions that represent a Fourier attention in Definition 1. The core of this implementation is the following Fourier metric function :
We directly implement as a torch.autograd.Function  in which we provide an efficient way to compute forward and backward function ( and gradient of ). While the implementation of the forward function is straight forward, the backward function is more tricky since we need to optimize the code to compute the gradient of w.r.t to variables , , and all at once. We can develop the backward function with highly parallel computation by exploiting GPU architecture and utilizing the reduction technique. The computational time is comparable to function cdist; thus, our FourierFormer implementation is as computationally time-efficient.
4 Experimental Results
) and image classification on ImageNet[19, 59] (Section 4.2). We aim to show that: (i) FourierFormer achieves better accuracy than the baseline transformer on a variety of practical tasks with different data modalities, and (ii) FourierFormer helps reduce head redundancy compared to the baseline transformer (Section 4.3).
Throughout the section, we compare FourierFormers with the baseline dot-product transformers of the same configuration. In all experiments, we made the constant in Fourier attention (see equation (54)) to be a learnable scalar and set choose the function (see Remark 2). All of our results are averaged over 5 runs with different seeds. More details on the models and training are provided in Appendix C. We also provide additional experimental results in Appendix D.
4.1 Language Modeling on WikiText-103
Datasets and metrics WikiText-103 is a collection of articles from Wikipedia, which have long contextual dependencies. The training set consists of about articles containing running words; this corresponds to text blocks of about 3600 words. The validation and test sets have and running words, respectively. Each of them contains articles and about words. Our experiment follows the standard setting [42, 63] and splits the training data into -word independent long segments. For evaluation, we use a batch size of 1, and process the text sequence with a sliding window of size . The last position is used for computing perplexity (PPL) except in the first segment, where all positions are evaluated as in [1, 63].
Models and baselines: Our implementation is based on the public code by .††Implementation available at https://github.com/IDSIA/lmtool-fwp. We use their small and medium models in our experiments. In particular, for small models, the key, value, and query dimension are set to 128, and the training and evaluation context length are set to 256. For medium models, the key, value, and query dimension are set to 256, and the training and evaluation context length are set to 384. In both configurations, the number of heads is 8, the feed-forward layer dimension is 2048, and the number of layers is 16.
Results: We report the validation and test perplexity (PPL) of FourierFormer versus the baseline transformer with the dot-product attention in Table 1. FourierFormers attain much better PPL than the baselines in both small and medium configurations. For the small configuration, the improvements of FourierFormer over the baseline are 1.29 PPL in validation and 1.44 PPL in test. For the medium configuration, these improvements are 1.39 PPL in validation and 1.59 PPL in test. These results suggest that the advantage of FourierFormer over the baseline dot-product transformer grows with the model’s size. This meets our expectation because larger models has larger query and key dimensions, e.g. the language model with medium configuration in this experiment has the query and key dimension of 256 versus 128 as in the language model with small configuration. Since the advantage of FourierFormer results from the property that FourierFormer can capture correlation between features in query and key vectors, the larger the query and key dimensions are, the more advantage FourierFormer has.
|Method||Valid PPL||Test PPL|
|Baseline dot-product (small)||33.15||34.29|
|Baseline dot-product (medium)||27.90||29.60|
4.2 Image Classification on ImageNet
Datasets and metrics The ImageNet dataset [19, 59] consists of training images and validation images. For this benchmark, the model learns to predict the category of the input image among 1000 categories. Top-1 and top-5 classification accuracies are reported.
Models and baselines: We use the DeiT-tiny model  with 12 transformer layers, 4 attention heads per layer, and the model dimension of 192. To train the models, we follow the same setting and configuration as for the baseline .††Implementation available at https://github.com/facebookresearch/deit.
Results: We summarize our resuls in Table 2. Same as in the language modeling experiment, for this image classification task, the Deit model equipped with FourierFormer significantly outperforms the baseline Deit dot-product transformer in both top-1 and top-5 accuracy. This result suggests that the advantage of FourierFormer over the baseline dot-product transformer holds across different data modalities.
|Method||Top-1 Acc||Top-5 Acc|
4.3 FourierFormer Helps Reducing Head Redundancy
To study the diversity between attention heads, given the model trained for the WikiText-103 language modeling task, we compute the average distance between heads in each layer. We show the layer-average mean and variance of distances between heads in Table 3. Results in Table 3 shows that FourierFormer obtains greater distance between attention heads than the baseline transformer with the dot-product attention and thus helps reduce the head redundancy. Note that we use the small configuration as specified in Section 4.1 for both models.
Laver-average mean and standard deviation ofdistances between heads of FourierFormer versus the baseline transformer with dot-product attention trained for the WikiText-103 language modeling task. FourierFormer has greater distance between heads than the baseline and thus captures more diverse attention patterns.
5 Related Work
Interpretation of Attention Mechanism in Transformers: Recent works have tried to gain an understanding of transformer’s attention from different perspectives.  considers attention as applying kernel smoother over the inputs. Extending this kernel approach, [31, 12, 81] linearize the softmax kernel in dot-product attention and propose a family of efficient transformers with linear computational and memory complexity.  then shows that these linear transformers are comparable to a Petrov-Galerkin projection 
, suggesting that the softmax normalization in the dot-product attention is sufficient but not necessary. Other works provide an understanding of attention in transformers via ordinary/partial differential equation include[41, 61]. In addition, [67, 26, 86, 47]
relate attentions in transformers to a Gaussian mixture models. Several works also connect the attention mechanism to graph-structured learning and message passing in graphical models[82, 64, 35]. Our work focuses on deriving the connection between self-attention and nonparametric kernel regression and exploring better regression estimator, such as the generalized Fourier nonparametric regression estimator, to improve the performance of transformers.
show that neurons and attention heads in the pre-trained transformer are redundant and can be removed when applied on a downstream task. By studying the contextualized embeddings in pre-trained networks, it has been demonstrated that the learned representations from these redundant models are highly anisotropic[45, 23]. Furthermore, [62, 66, 78, 60] employ knowledge distillation and sparse approximation to enhance the efficiency of transformers. Our FourierFormer is complementary to these methods and can be combined with them.
6 Concluding Remarks
In this paper, we establish the correspondence between the nonparametric kernel regression and the self-attention in transformer. We then develop the generalized Fourier integral estimators and propose the FourierFormer, a novel class of transformers that use the generalized Fourier integral estimators to construct their attentions for efficiently capturing the correlations between features in the query and key vectors. We theoretically prove the approximation guarantees of the generalized Fourier integral estimators and empirically validate the advantage of FourierFormer over the baseline transformer with the dot-product attention in terms of accuracy and head redundancy reduction. It is interesting to incorporate robust kernels into the nonparametric regression framework of FourierFormer to enhance the robustness of the model under data perturbation and adversarial attacks. A limitation of FourierFormer is that it still has the same quadratic computational and memory complexity as the baseline transformer with the dot-product attention. We leave the development of the linear version of FourierFormer that achieves linear computational and memory complexity as future work. It is worth noting that there is no potential negative societal impacts of FourierFormer.
Supplement to “FourierFormer: Transformer Meets Generalized Fourier Integral Theorem”
In the supplementary material, we collect proofs, additional theories, and experiment results deferred from the main text. In Appendix A, we provide additional theoretical results for generalized Fourier density estimator and for generalized Fourier nonparametric regression estimator. We provide proofs of key results in the main text and additional theories in Appendix B. We present experiment details in Appendix C while including additional experimental results in Appendix D.
Appendix A Additional Theoretical Results
a.1 Generalized Fourier density estimator
We now establish the MISE rate of in equation (12) when and . We consider the following tail bounds on the Fourier transform of the true density function as follows.
(1) We say that is supersmooth of order if we have universal constants and such that the following inequalities hold for almost surely :
Here, denotes the Fourier transform of the function .
(2) The function is ordinary smooth of order if there exists universal constant such that the following inequality holds for almost surely :
The notions of supersmoothness and ordinary smoothness had been used widely in deconvolution problems  and density estimation problems [17, 74, 29]. The supersmooth condition is satisfied when the function
is Gaussian distribution or Cauchy distribution while the ordinary smooth condition is satisfied when the function
is Laplace distribution and Beta distribution.
Based on the smoothness conditions in Definition 3, we have the following result regarding the mean-square integrated error (MISE) of the function generalized Fourier density estimator (12) (see equation (13) for a definition of MISE) when and .
(a) When , the following holds:
(Supersmooth setting) If the true density function is supersmooth function of order for some , then there exists universal constants and such that as long as we have
(Ordinary smooth setting) If the true density function is ordinary smooth function of order for some , then there exists universal constants such that
(b) When , the following holds
(Supersmooth setting) If the true density function is supersmooth function of order for some , then there exists universal constants and such that as long as we have
(Ordinary smooth setting) If the true density function is ordinary smooth function of order for some , then there exists universal constants such that
When : As part (a) of Theorem 3 indicates, when the function is supersmooth, by choosing the radius to balance the bias and variance, we have the optimal as and the MISE rate of the generalized Fourier density estimator becomes . It indicates that, the MISE rate of is parametric when the function is supersmooth. On the other hand, when the function is ordinary smooth, the optimal becomes and the MISE rate becomes . It is slower than the MISE rate when the function is supersmooth.
When : The results of part (b) of Theorem 3 demonstrate that the upper bounds for the MISE rate of the generalized Fourier density estimator is similar for both the supersmooth and ordinary smooth settings. The optimal radius and the MISE rate of the estimator is .
a.2 Generalized Fourier nonparametric regression estimator
In this appendix, we provide additional result for the mean square error (MSE) rate of the generalized Fourier nonparametric regression estimator in equation (14) when , namely, the setting of the Fourier integral theorem. The results when for are left for the future work.
When , the MSE rate of had been established in Theorem 9 of Ho et al.  when the function is supersmooth function. Here, we restate that result for the completeness.
Assume that the function is supersmooth function of order for some and . Furthermore, we assume that the function in the nonparametric regression model (3) is such that and
where is the Fourier transform of the function , and are some universal constants, and is some polynomial function of with non-negative coefficients. Then, we can find universal constants such that as long as we have
where denotes the degree of the polynomial function , and we define .
Appendix B Proofs
In this Appendix, we provide proofs for key results in the paper and in Appendix A.
b.1 Proof of Theorem 1
Recall that, are i.i.d. samples from the density function . In equation (12), the generalized Fourier density estimator of is given by:
where , , and . Direct calculation demonstrates that
An application of Taylor expansion up to the -th order indicates that
where , , and is Taylor remainder admitting the following form:
According to the hypothesis that for all , we obtain that
for any such that . Collecting the above results, we arrive at