The Transformer architecture has become dominant among state-of-the-art machine-learning (ML) models across nearly every benchmark, on tasks ranging from natural language understanding and modelingNIPS2017_3f5ee243, devlin2018bert, dai2019transformer, radford2018improving, radford2019language, brown2020language, yang2019xlnet
, to video scene-understandingwang2018non. Despite these successes, most of the best-performing models still rely on deep and heavily compounded layers of computationally expensive self-attention modules, each of which computes its own quadratic structural equation model (SEM) with its own graph adjacency matrix. The success of the Transformer has proven that compounding these SEM’s results in a uniquely effective function approximator for even the most complex correlation functions, such as those that determine the structure of natural languages. However, there is also a growing body of evidence fan2020addressing, bapna2020controlling, choromanski2020masked, choromanski2020rethinking, DBLP:journals/corr/abs-2002-07106, DBLP:journals/corr/abs-2006-03555, DBLP:journals/corr/abs-2009-14794 that many of these computations are superfluous and that many state-of-the-art results can be reproduced with significantly fewer learnable parameters, making computations more efficient and generally leading to faster training and better performing models
Optimizing the Transformer is currently an active field of research, and currently many of the most effective methods involve complicated rearrangements of traditional architectures. In a recent work lee2021fnet
, the authors presented a uniquely simplified variation on the standard autoencoding Transformer architecture, in which they substitute several self-attention sublayers with a computationally trivial procedure for mixing tokens using Fourier transform coefficients, thus benefiting from the machinery of FFT algorithms such as Cooley-Tukey. Among other things, they demonstrate that these models can retain up to 92% accuracy on the GLUE benchmark without any self-attention sublayers, and up to 97% accuracy with only 2 self-attention sublayers out of 12; resulting in a 7-fold increase in training speed on GPU. These results imply that numerous careful computations of a graph adjacency matrix at each layer of the Transformer may be largely redundant, and that a much simpler structure can likely be used for accurately modeling high-level semantic meaning in natural languages. It thus is natural to ask whether these results generalize to pre-training tasks such as autoregressive language-modeling (i.e. next-word prediction).
In this note, we explore the question of how many self-attention sublayers are sufficient for accurately modeling the causal structure of natural language. To this end we develop an autoregressive generalization of the FNet algorithm, called FNetAR, and apply it to the task of causal language modeling. Our experiments produce analogous results to those from the FNet analysis, with models retaining a state-of-the-art perplexity score on the Wikitext-103 benchmark compared to despite having up to half of their self-attention sublayers removed. In Section II we start with a brief pedagogical review of Transformer blocks and FNet blocks, as well as a description of the FNetAR generalization. In Section III we describe our experiments with the Wikitext-103 benchmark and report its performance in comparison to a baseline Transformer-XL model. In Section IV we conclude with a discussion of these results as well as future directions.
ii.1 Transformer Blocks
The Transformer block is defined by a 2-layer ResNet architecture with one self-attention layer followed by one position-wise feedforward layer. The ResNet architecture implies that the transformation of a given datum through each layer of the block is restricted to an additive contribution from the output of some neural network with some normalization parameter , as shown in Equation 1.
This additive transformation is referred to as a residual connection, and ensures that:
the transformation through the model starts from an identity operation
the transformation occurs gradually as the data passes through each layer
neural-network blocks with different architectures can be stacked (like LEGOs)
The self-attention layer of the Transformer block is responsible for learning the structural relationships between different elements of a sequence , in this case represented by linguistic tokens. These relationships are determined by a graph adjacency matrix that is generated by the model and optimized to learn the relative importance of relationships between sequence elements and towards a given task. In the standard Transformer block, the graph matrix is generated by two neural networks and as in Equation 2.
When this graph matrix is contracted with some function of the input sequence and implemented with a residual connection, the result is a second order structural equation model (SEM) as shown in Equation 3. Here and are two additional neural networks applied to the input data before and after contraction with the graph matrix, respectively.
Finally the position-wise feedforward layer consists of a single dense neural network applied to each entity in parallel, implemented with a residual connection as shown in Equation 4.
ii.2 Autoregressive Transformers
The Transformer block can trivially be made autoregressive by applying a causal mask to the graph matrix that reduces it to a lower-triangular form. This has the effect of restricting its contraction with the input data to eliminate causality-violating relationships, as in the modified SEM shown in Equation 5, and enabling it for use in time-series prediction tasks such as causal language modeling (next-word prediction).
Heavily compounding these quadratic SEM’s into deep neural networks results in the powerful universal-function-approximator known as the Transformer. The generated graph matrices , being the only source of token-mixing in the Transformer, act as a bottleneck for the flow of information between sequence elements and . Additionally, since these graph matrices are -dependent, the self-attention layers will generically learn different relationships for sufficiently different sets of input data, thus imbuing the model with context-dependence.
ii.3 FNet Blocks
The FNet block is defined by a 2-layer ResNet architecture with one Fourier-mixing layer followed by a position-wise feed-forward layer. The Fourier-mixing layer is analogous to the self-attention layer of the Transformer block, except that the learnable context-dependent graph matrix is eliminated in favor of a linear rules-based operation that mixes a sequence element by sampling its complementary elements with discrete wave-form coefficients whose frequencies decay as a function of distance222In the original implementation of FNet the authors apply a full 2D FFT on both the sequence and hidden dimensions. Here we describe a simpler variation with an FFT applied to the sequence dimension only. This sampling strategy is equivalent to fixing the graph adjacency matrix of the standard Transformer to be the matrix shown in Equation 6, where .
Although this process resembles an FFT, it should really be thought of as a rules-based strategy for sparsely sampling different linear combinations of the sequence input prior to running them through the position-wise feedforward layer. The most salient reason that this sampling strategy should NOT be thought of as a “genuine” FFT is the fact that the residual connection additively mixes Fourier and non-Fourier modes333The authors of this paper are not aware of any sensible mathematical interpretation for this procedure.
ii.4 FNet Autoregressive
Since the FNet sampling strategy is not technically a Fourier transform, there likely does not exist a correspondingly unique procedure for making it autoregressive in the sense of Equation 5. However the requirement that every element in a mini-batch samples with a context window of the same size requires an attention graph that is simultaneously upper and lower triangular in addition to being non-trivial. For this reason FNetAR is more amenable to combination with recurrent Transformer architectures such as Transformer-XL, which compose their attention graphs by concatenating their hidden states with additional “memory states” along the sequence dimension, resulting in an attention graph matrix that is non-square-rectangular of shape . Within these frameworks, a causally faithful version of the FNet graph matrix in Equation 6
can be obtained using a simple procedure that involves padding a Fourier transform matrix with zeros of shapeand then performing a roll over the rows as shown in Equation 7, where .
In this construction, each sequence element samples a specific frequency-mode of its preceding elements, whose magnitude is largest for late parts of the sequence and decays down a mean-sampling at the beginning of the sequence. This kind of autoregressive Fourier transform has been developed and applied to problems in computer visionDBLP:journals/corr/abs-2104-02555, but to our knowledge this is its first application to the task of causal language-modeling.
iii.1 Wikitext-103 Benchmark
We tested performance of FNetAR against the Transformer-XL baseline on the task of next-word predicion using the Wikitext-103 benchmark dataset. Our preliminary baseline is the Transformer-XL medium-sized model with Transformer blocks, each having hidden dimension . We find that despite replacing of the self-attention layers with the linear operation in Equation 7, FNetAR retains surprisingly strong performance, achieving a perplexity score of relative to for the Transformer-XL baseline.
|Transformer-XL Medium:||24.23||41.1 M||151.1 M|
|FNetAR Medium:||25.81||34.3 M||144.4 M|
|FNetAR Large:||XX.X||198.3 M||237.9 M|
|Transformer-XL Large:||18.31||245.5 M||285.2 M|
The unreasonable effectiveness of this FFT-inspired sampling procedure, as a replacement for self-attention, stems from the fact that every linear-combination of sequence embeddings that is generated by contraction with the graph matrix is funneled through the same position-wise feedforward network. An effective flow of information through the network thus requires that some structure be encoded into the embeddings, which allows the feedforward network to disambiguate different elements of the sequence. This is identical to the process used to generate positional embeddings. FFT coefficients naturally provide a powerful schema for sampling linear combinations of vectorized representations in a way that maximizes the distinguishability between different components (because that’s like what Fourier transforms do).
WORK IN PROGRESS: For now the FNetAR algorithm exists as (1) further evidence that numerous compounded computations of a structure graph are superfluous for many tasks in natural language understanding (2) a systematic and simplistic method for parameter reduction, applicable to any recurrent Transformer model. Although FNetAR should also be faster than its Transformer-XL counterpart, we are currently working on optimizing the autoregressive Fourier transform and will not be able to comment on the gains in training speed until v1 of this note is released. This updated v1 will also include a comparison of the large models, as well as combinations with the Feedback Transformer, which is likely highly optimized relative to Transformer-XL. There is also reason to believe that FNet may improve the interpretability of the attention-score graphs. Since FNet squeezes the structure-learning ability of standard Transformers into a fewer number of layers, the relationships learned will be fewer and thus each will likely be more meaningful. A cursory exploration of this should also be expected.
Despite the fact that this sampling strategy does not produce an overall transformation that resembles a Fourier transform mathematically, we find these experiments to be useful for thinking about the question of how to optimize the extraction of information using Fourier duality. All evidence indicates that intellectually useful information exists at multiple scales and is coded in both local and non-local correlations. We thus find it plausible that Fourier transforms may be a salient component of systems that efficiently extract both local and non-local information. Subsequently its autoregressive generalizations would be necessary for adaptation to tasks such as time-series prediction and causal inference. Indeed we find it highly likely both that (1) existing architectures such as convolutional networks are already leveraging the equivalence between kernel-convolutions and Fourier transforms (2) there exist additional time invariant or convolutional causal forms that could be used to construct further optimized sampling strategies.