I Introduction
The Transformer architecture has become dominant among stateoftheart machinelearning (ML) models across nearly every benchmark, on tasks ranging from natural language understanding and modeling
NIPS2017_3f5ee243, devlin2018bert, dai2019transformer, radford2018improving, radford2019language, brown2020language, yang2019xlnet, to video sceneunderstanding
wang2018non. Despite these successes, most of the bestperforming models still rely on deep and heavily compounded layers of computationally expensive selfattention modules, each of which computes its own quadratic structural equation model (SEM) with its own graph adjacency matrix. The success of the Transformer has proven that compounding these SEM’s results in a uniquely effective function approximator for even the most complex correlation functions, such as those that determine the structure of natural languages. However, there is also a growing body of evidence fan2020addressing, bapna2020controlling, choromanski2020masked, choromanski2020rethinking, DBLP:journals/corr/abs200207106, DBLP:journals/corr/abs200603555, DBLP:journals/corr/abs200914794 that many of these computations are superfluous and that many stateoftheart results can be reproduced with significantly fewer learnable parameters, making computations more efficient and generally leading to faster training and better performing modelsOptimizing the Transformer is currently an active field of research, and currently many of the most effective methods involve complicated rearrangements of traditional architectures. In a recent work lee2021fnet
, the authors presented a uniquely simplified variation on the standard autoencoding Transformer architecture, in which they substitute several selfattention sublayers with a computationally trivial procedure for mixing tokens using Fourier transform coefficients, thus benefiting from the machinery of FFT algorithms such as CooleyTukey. Among other things, they demonstrate that these models can retain up to 92% accuracy on the GLUE benchmark without any selfattention sublayers, and up to 97% accuracy with only 2 selfattention sublayers out of 12; resulting in a 7fold increase in training speed on GPU. These results imply that numerous careful computations of a graph adjacency matrix at each layer of the Transformer may be largely redundant, and that a much simpler structure can likely be used for accurately modeling highlevel semantic meaning in natural languages. It thus is natural to ask whether these results generalize to pretraining tasks such as autoregressive languagemodeling (i.e. nextword prediction).
In this note, we explore the question of how many selfattention sublayers are sufficient for accurately modeling the causal structure of natural language. To this end we develop an autoregressive generalization of the FNet algorithm, called FNetAR, and apply it to the task of causal language modeling. Our experiments produce analogous results to those from the FNet analysis, with models retaining a stateoftheart perplexity score on the Wikitext103 benchmark compared to despite having up to half of their selfattention sublayers removed. In Section II we start with a brief pedagogical review of Transformer blocks and FNet blocks, as well as a description of the FNetAR generalization. In Section III we describe our experiments with the Wikitext103 benchmark and report its performance in comparison to a baseline TransformerXL model. In Section IV we conclude with a discussion of these results as well as future directions.
Ii Modeling
ii.1 Transformer Blocks
The Transformer block is defined by a 2layer ResNet architecture with one selfattention layer followed by one positionwise feedforward layer. The ResNet architecture implies that the transformation of a given datum through each layer of the block is restricted to an additive contribution from the output of some neural network with some normalization parameter , as shown in Equation 1.
(1) 
This additive transformation is referred to as a residual connection, and ensures that:

the transformation through the model starts from an identity operation

the transformation occurs gradually as the data passes through each layer

neuralnetwork blocks with different architectures can be stacked (like LEGOs)
The selfattention layer of the Transformer block is responsible for learning the structural relationships between different elements of a sequence , in this case represented by linguistic tokens. These relationships are determined by a graph adjacency matrix that is generated by the model and optimized to learn the relative importance of relationships between sequence elements and towards a given task. In the standard Transformer block, the graph matrix is generated by two neural networks and as in Equation 2.
(2) 
When this graph matrix is contracted with some function of the input sequence and implemented with a residual connection, the result is a second order structural equation model (SEM) as shown in Equation 3. Here and are two additional neural networks applied to the input data before and after contraction with the graph matrix, respectively.
(3) 
Finally the positionwise feedforward layer consists of a single dense neural network applied to each entity in parallel, implemented with a residual connection as shown in Equation 4.
(4) 
ii.2 Autoregressive Transformers
The Transformer block can trivially be made autoregressive by applying a causal mask to the graph matrix that reduces it to a lowertriangular form. This has the effect of restricting its contraction with the input data to eliminate causalityviolating relationships, as in the modified SEM shown in Equation 5, and enabling it for use in timeseries prediction tasks such as causal language modeling (nextword prediction).
(5) 
Heavily compounding these quadratic SEM’s into deep neural networks results in the powerful universalfunctionapproximator known as the Transformer. The generated graph matrices , being the only source of tokenmixing in the Transformer, act as a bottleneck for the flow of information between sequence elements and . Additionally, since these graph matrices are dependent, the selfattention layers will generically learn different relationships for sufficiently different sets of input data, thus imbuing the model with contextdependence.
ii.3 FNet Blocks
The FNet block is defined by a 2layer ResNet architecture with one Fouriermixing layer followed by a positionwise feedforward layer. The Fouriermixing layer is analogous to the selfattention layer of the Transformer block, except that the learnable contextdependent graph matrix is eliminated in favor of a linear rulesbased operation that mixes a sequence element by sampling its complementary elements with discrete waveform coefficients whose frequencies decay as a function of distance^{2}^{2}2In the original implementation of FNet the authors apply a full 2D FFT on both the sequence and hidden dimensions. Here we describe a simpler variation with an FFT applied to the sequence dimension only. This sampling strategy is equivalent to fixing the graph adjacency matrix of the standard Transformer to be the matrix shown in Equation 6, where .
(6) 
Although this process resembles an FFT, it should really be thought of as a rulesbased strategy for sparsely sampling different linear combinations of the sequence input prior to running them through the positionwise feedforward layer. The most salient reason that this sampling strategy should NOT be thought of as a “genuine” FFT is the fact that the residual connection additively mixes Fourier and nonFourier modes^{3}^{3}3The authors of this paper are not aware of any sensible mathematical interpretation for this procedure.
ii.4 FNet Autoregressive
Since the FNet sampling strategy is not technically a Fourier transform, there likely does not exist a correspondingly unique procedure for making it autoregressive in the sense of Equation 5. However the requirement that every element in a minibatch samples with a context window of the same size requires an attention graph that is simultaneously upper and lower triangular in addition to being nontrivial. For this reason FNetAR is more amenable to combination with recurrent Transformer architectures such as TransformerXL, which compose their attention graphs by concatenating their hidden states with additional “memory states” along the sequence dimension, resulting in an attention graph matrix that is nonsquarerectangular of shape . Within these frameworks, a causally faithful version of the FNet graph matrix in Equation 6
can be obtained using a simple procedure that involves padding a Fourier transform matrix with zeros of shape
and then performing a roll over the rows as shown in Equation 7, where .(7) 
In this construction, each sequence element samples a specific frequencymode of its preceding elements, whose magnitude is largest for late parts of the sequence and decays down a meansampling at the beginning of the sequence. This kind of autoregressive Fourier transform has been developed and applied to problems in computer vision
DBLP:journals/corr/abs210402555, but to our knowledge this is its first application to the task of causal languagemodeling.Iii Experiments
iii.1 Wikitext103 Benchmark
We tested performance of FNetAR against the TransformerXL baseline on the task of nextword predicion using the Wikitext103 benchmark dataset. Our preliminary baseline is the TransformerXL mediumsized model with Transformer blocks, each having hidden dimension . We find that despite replacing of the selfattention layers with the linear operation in Equation 7, FNetAR retains surprisingly strong performance, achieving a perplexity score of relative to for the TransformerXL baseline.
Perplexity (ppl)  (Transformer)  (All)  

TransformerXL Medium:  24.23  41.1 M  151.1 M 
FNetAR Medium:  25.81  34.3 M  144.4 M 
FNetAR Large:  XX.X  198.3 M  237.9 M 
TransformerXL Large:  18.31  245.5 M  285.2 M 
Iv Discussion
The unreasonable effectiveness of this FFTinspired sampling procedure, as a replacement for selfattention, stems from the fact that every linearcombination of sequence embeddings that is generated by contraction with the graph matrix is funneled through the same positionwise feedforward network. An effective flow of information through the network thus requires that some structure be encoded into the embeddings, which allows the feedforward network to disambiguate different elements of the sequence. This is identical to the process used to generate positional embeddings. FFT coefficients naturally provide a powerful schema for sampling linear combinations of vectorized representations in a way that maximizes the distinguishability between different components (because that’s like what Fourier transforms do).
WORK IN PROGRESS: For now the FNetAR algorithm exists as (1) further evidence that numerous compounded computations of a structure graph are superfluous for many tasks in natural language understanding (2) a systematic and simplistic method for parameter reduction, applicable to any recurrent Transformer model. Although FNetAR should also be faster than its TransformerXL counterpart, we are currently working on optimizing the autoregressive Fourier transform and will not be able to comment on the gains in training speed until v1 of this note is released. This updated v1 will also include a comparison of the large models, as well as combinations with the Feedback Transformer, which is likely highly optimized relative to TransformerXL. There is also reason to believe that FNet may improve the interpretability of the attentionscore graphs. Since FNet squeezes the structurelearning ability of standard Transformers into a fewer number of layers, the relationships learned will be fewer and thus each will likely be more meaningful. A cursory exploration of this should also be expected.
Despite the fact that this sampling strategy does not produce an overall transformation that resembles a Fourier transform mathematically, we find these experiments to be useful for thinking about the question of how to optimize the extraction of information using Fourier duality. All evidence indicates that intellectually useful information exists at multiple scales and is coded in both local and nonlocal correlations. We thus find it plausible that Fourier transforms may be a salient component of systems that efficiently extract both local and nonlocal information. Subsequently its autoregressive generalizations would be necessary for adaptation to tasks such as timeseries prediction and causal inference. Indeed we find it highly likely both that (1) existing architectures such as convolutional networks are already leveraging the equivalence between kernelconvolutions and Fourier transforms (2) there exist additional time invariant or convolutional causal forms that could be used to construct further optimized sampling strategies.