1 Introduction
Transformers are deep feedforward artificial neural networks with a (self)attention mechanism. They have been tremendously successful in natural language processing tasks and other domains. Since their inception 5 years ago
[VSP17], many variants have been suggested [LWLQ21]. Descriptions are usually graphical, verbal, partial, or incremental. Despite their popularity, it seems no pseudocode has ever been published for any variant. Contrast this to other fields of computer science, even to “cousin” discipline reinforcement learning
[MKS13, SBB18, EMK21].This report intends to rectify the situation for Transformers. It aims to be a selfcontained, complete, precise and compact overview of transformer architectures and formal algorithms (but not results). It covers what Transformers are (Section 6), how they are trained (Section 7), what they’re used for (Section 3), their key architectural components (Section 5), tokenization (Section 4), and a preview of practical considerations (Section 8) and the most prominent models.
The essentially complete pseudocode is about 50 lines, compared to thousands of lines of actual real source code. We believe these formal algorithms will be useful for theoreticians who require compact, complete, and precise formulations, experimental researchers interested in implementing a Transformer from scratch, and encourage authors to augment their paper or text book with formal Transformer algorithms (Section 2).
The reader is assumed to be familiar with basic ML terminology and simpler neural network architectures such as MLPs.
In short, a (formally inclined) reader, upon understanding the contents of this document, will have a solid grasp of transformers: they will be ready to read and contribute to the literature on the topic as well as implement their own Transformer using the pseudocode as templates.
2 Motivation
The true story above the introduction describes quite well the feeling we have when browsing many Deep Learning (DL) papers; unable to figure out the algorithmic suggestions exactly. For practitioners, the papers may be sufficiently detailed, but the precision needed by theoreticians is usually higher. For reasons not entirely clear to us, the DL community seems shy of providing pseudocode for their neural network models. Below we describe the SOTA in DL paper writing and argue for the value of formal algorithms. The reader already convinced about their merit can without loss skip this section.
The lack of scientific precision and detail in DL publications.
Deep Learning has been tremendously successful in the last 5 to 10 years with thousands of papers published every year. Many describe only informally how they change a previous model, Some 100+ page papers contain only a few lines of prose informally describing the model [RBC21]. At best there are some highlevel diagrams. No pseudocode. No equations. No reference to a precise explanation of the model. One may argue that most DL models are minor variations of a few core architectures, such as the Transformer [VSP17], so a reference augmented by a description of the changes should suffice. This would be true if (a) the changes were described precisely, (b) the reference architecture has been described precisely elsewhere, and (c) a reference is given to this description. Some if not all three are lacking in most DL papers. To the best of our knowledge noone has even provided pseudocode for the famous Transformer and its encoder/decoderonly variations.
Interfacing algorithms.
Equally important are proper explanations of how these networks are trained and used, but sometimes it is not even clear what the inputs and outputs and potential sideeffects of the described model are. Of course someone experienced would usually be able to correctly guess, but this is not a particularly scientific approach. The experimental section in publications often does not explain what is actually fed into the algorithms and how. If there is some explanation in the methods section, it is often disconnected from what is described in the experimental section, possibly due to different authors writing the different sections. The core algorithms for the models should be accompanied by the wrapper algorithms that call them, e.g. (pre)training, finetuning, prompting, inference, deployment. Sometimes this is simple, but sometimes the magic happens there. In any case, if these things are not formally stated they remain unclear. Again, if the setup is standard and has been formally explained elsewhere, a simple reference will do.
Source code vs pseudocode.
Providing open source code is very useful, but not a proper substitute for formal algorithms. There is a massive difference between a (partial) Python dump and wellcrafted pseudocode. A lot of abstraction and cleanup is necessary: remove boiler plate code, use mostly singleletter variable names, replace code by math expressions wherever possible, e.g. replace loops by sums, remove (some) optimizations, etc. A wellcrafted pseudocode is often less than a page and still essentially complete, compared to often thousands of lines of real source code. This is hard work noone seems to be willing to do. Of course a forward process of first designing algorithms and write up pseudocode on paper, and then implementing it is fine too, but few DL practitioners seem to work that way.
Examples of good neural network pseudocode and mathematics and explanations.
MultiLayer Perceptrons (MLPs) are usually welldescribed in many papers, e.g.
[MPCB14, BFT17, JGH18], though also without pseudocode. For a rare textbook example of pseudocode for a nontrivial neural network architecture, see Algorithm S2 of [SGBK21], which constitutes a complete, i.e. essentially executable, pseudocode of just 25 lines based on a 350line Python Colab toy implementation, which itself is based on a proper 1000+ line implementation.This work aims to do the same for Transformers: The whole decoderonly Transformer GPT LABEL:algo:DTransformer based on attention Algorithms 5 and 4 and normalization Algorithm 6 including training Algorithm 10 and prompting and inference Algorithm 11 alltogether is less than 50 lines of pseudocode, compared to e.g. the 2000line selfcontained Cimplementation [Bel21].
[Ala19] is a great blogpost explaining Transformers and [EGKZ21] describes the attention mechanism to sufficient mathematical precision to allow proving properties about it, but neither provides pseudocode. [Elh21] is an attempt to understand Transformers by reverseengineering the computations they perform and interpreting them as circuits.
Motivation.
But does anyone actually need pseudocode and what would they be useful for (we sometimes get asked)? We find the absence of pseudocode in DL and this question quite perplexing, but apparently it requires answering. Providing such pseudocode can be useful for many purposes:

They can be used as templates and adapted to precisely describe future variations, and therewith set a new standard in DL publishing. We explicitly encourage the reader to copy and adapt them to their needs and cite the original as “adapted from [PH22]”.

Having all that matters on one page in front of you makes it easier to develop new variations compared to reading prose or scrolling through 1000s of lines of actual code.

They can be used as a basis for new implementations from scratch, e.g. in different programming languages, without having to wade through and reverseengineer existing idiosyncratic real source code.

They may establish notational convention, which eases communication and reading future variations.

The process of converting source code into pseudocode can exhibit implementation errors (as it e.g. did in [SGBK21]).

Theoreticians need compact, complete, and precise representations for reasoning and ultimately proving properties about algorithms. They are often unwilling or unable to reverse engineer code, or guess the meaning of words or fancy diagrams.
With this motivation in mind, the following five sections formally describe all aspects of transformer architectures, training, and inference.
3 Transformers and Typical Tasks
Transformers are neural network models that excel at natural language processing, or more generally at modelling sequential data. Two common types of tasks they are used for are sequence modelling and sequencetosequence prediction.
Notation.
Let denote a finite set, called a vocabulary, often identified with . This could be words or letters, but typically are subwords, called tokens. Let be a sequence of tokens, e.g. a sentence or a paragraph or a document. Unlike in Python, we use arrays starting from , and includes . For a matrix , we write for the th row and for the th column. We use matrix
column vector convention more common in mathematics, compared to the default row vector
matrix in the transformer literature, i.e. our matrices are transposed. See Appendix B for a complete list of notation.Chunking.
The predominant paradigm in machine learning is (still) learning from independent and identically distributed (i.i.d.) data. Even for sequence modelling for practical reasons this tradition is upheld. The training data may naturally be a collection of (independent) articles, but even then, some may exceed the maximal context length
transformers can handle. In this case, an article is crudely broken into shorter chunks of length .Sequence modelling (DTransformer).
Given a vocabulary , let for be a dataset of sequences (imagined to be) sampled i.i.d. from some distribution over
. The goal is to learn an estimate
of the distribution. In practice, the distribution estimate is often decomposed via the chain rule as
, where consists of all neural network parameters to be learned. The goal is to learn a distribution over a single token given its preceding tokens as context.Examples include e.g. language modelling, RL policy distillation, or music generation.
Sequencetosequence (seq2seq) prediction (EDTransformer).
Given a vocabulary and an i.i.d. dataset of sequence pairs , where is a distribution over , learn an estimate of the conditional distribution . In practice, the conditional distribution estimate is often decomposed as .
Examples include translation ( a sentence in English, the same sentence in German), question answering ( question, the corresponding answer), texttospeech ( a piece of text, a voice recording of someone reading the text).
Classification (ETransformer).
Given a vocabulary and a set of classes , let for be an i.i.d. dataset of sequenceclass pairs sampled from . The goal in classification is to learn an estimate of the conditional distribution .
Examples include e.g. sentiment classification, spam filtering, toxicity classification.
4 Tokenization: How Text is Represented
In the context of natural language tasks, tokenization refers to how a piece of text such as “My grandma makes the best apple pie.” is represented as a sequence of vocabulary elements (called tokens).
Characterlevel tokenization.
One possible choice is to let be the English alphabet (plus punctuation). In the example above, we’d get a sequence of length 36: [‘M’, ‘y’, ‘ ’, …]. Characterlevel tokenization tends to yield very long sequences.
Wordlevel tokenization.
Another choice would be to let consist of all English words (plus punctuation). In the example above, we’d get a sequence of length 7: [‘My ’, ‘grandma ’, ‘makes ’, …]. Wordlevel tokenization tends to require a very large vocabulary and cannot deal with new words at test time.
Subword tokenization.
This is the method used in practice nowadays: is a set of commonly occurring word segments like ‘cious’, ‘ing’, ‘pre’. Common words like ‘is ’ are often a separate token, and single characters are also included in to ensure all words are expressible.
Final vocabulary and text representation.
Given a choice of tokenization / vocabulary, each vocabulary element is assigned a unique index in . A number of special tokens are then added to the vocabulary. The number of special tokens varies, and here we will consider three: , used in masked language modelling (see Algorithm 9); , used for representing the beginning of sequence; and , used for representing the end of sequence. The complete vocabulary has elements.
A piece of text is represented as a sequence of indices (called token IDs) corresponding to its (sub)words, preceded by bos_token and followed by eos_token.
5 Architectural Components
The following are the neural network building blocks (functions with learnable parameters) from which transformers are made. Full architectures featuring these building blocks are presented in the next section. (By a slight abuse of notation, we identify with the set .)
Token embedding.
The token embedding learns to represent each vocabulary element as a vector in ; see Algorithm 1.
Positional embedding.
The positional embedding learns to represent a token’s position in a sequence as a vector in . For example, the position of the first token in a sequence is represented by a (learned) vector , the position of the second token is represented by another (learned) vector , etc. The purpose of the positional embedding is to allow a Transformer to make sense of word ordering; in its absence the representation would be permutation invariant and the model would perceive sequences as “bags of words” instead.
Learned positional embeddings require that input sequence length is at most some fixed number (the size of the learned positional embedding matrix must be finite and fixed in advance of training). An intuitive explanation of how this works can be found at [Ala18]. For pseudocode, see Algorithm 2.
Not all transformers make use of learned positional embeddings, some use a hardcoded mapping instead [Ker21]. Such hardcoded positional embeddings can (theoretically) handle arbitrarily long sequences. The original Transformer [VSP17] uses
for .
The positional embedding of a token is usually added to the token embedding to form a token’s initial embedding. For the th token of a sequence , the embedding is
(1) 
Attention.
Attention is the main architectural component of transformers. It enables a neural network to make use of contextual information (e.g. preceding text or the surrounding text) for predicting the current token.
On a high level, attention works as follows: the token currently being predicted is mapped to a query vector , and the tokens in the context are mapped to key vectors and value vectors . The inner products are interpreted as the degree to which token is important for predicting the current token – they are used to derive a distribution over the context tokens, which is then used to combine the value vectors. An intuitive explanation how this achieves attention can be found at [Ala18, Ala19]. The precise algorithm is given in Algorithm 3.
There are many ways the basic attention mechanism is used in transformers. We list some of the most common variants below.
It will be useful to define the softmax function for matrix arguments, as well as a Mask matrix:
(2) 
(3) 
Bidirectional / unmasked selfattention.
Given a sequence of token representations, this variant applies attention to each token, treating all tokens in the sequence as the context. See Algorithm 4, called with token sequence and no masking (Mask≡1).
Unidirectional / masked selfattention.
Given a sequence of token representations, this variant applies attention to each token, treating all preceding tokens (including itself) as the context. Future tokens are masked out, so this causal autoregressive version can be used for online prediction. See Algorithm 4, called with token sequence and Mask. For this Mask, the output only depends on , hence can be used to predict .
Crossattention.
Given two sequences of token representations (often in the context of a sequencetosequence task), this variant applies attention to each token of the primary token sequence , treating the second token sequence as the context. See Algorithm 4, called with Mask≡1. While the output and input sequences have the same length , the context sequence can have different length .
Multihead attention.
The attention algorithm presented so far (Algorithm 4) describes the operation of a single attention head. In practice, transformers run multiple attention heads (with separate learnable parameters) in parallel and combine their outputs; this is called multihead attention; see Algorithm 5
Layer normalisation.
Layer normalisation explicitly controls the mean and variance of individual neural network activations; the pseudocode is given in
Algorithm 6. Some transformers use a simpler and more computationally efficient version of layer normalization setting , called root mean square layer normalisation, or RMSnorm.Unembedding.
The unembedding learns to convert a vector representation of a token and its context into a distribution over the vocabulary elements; see Algorithm 7. The algorithm describes an independently learned unembedding matrix, but note that sometimes the unembedding matrix is instead fixed to be the transpose of the embedding matrix.
6 Transformer Architectures
This section presents a few prominent transformer architectures, based on attention Algorithms 5 and 4 and using normalization Algorithm 6, in roughly historical order:

EDT [VSP17] The original sequencetosequence / EncoderDecoder Transformer, Algorithms 12, 8 and LABEL:algo:EDTransformer.

BERT [DCLT19], which is an instance of an encoderonly transformer (encoderonly means that it is derived from the encoderdecoder architecture by dropping the decoder part), Algorithms 9 and LABEL:algo:ETransformer.

GPT [RWC19, BMR20], which is an instance of a decoderonly transformer, Algorithms 11, 10 and LABEL:algo:DTransformer.
While the main architectural difference between BERT and GPT is in attention masking, they also differ in a number of less important ways: e.g. they use different activation functions and the layernorms are positioned differently. We included these differences in the pseudocode to stay faithful to the original algorithms, but note that different transformer architectures may adopt these selectively.
To simplify notation, we denote by the entire set of parameters (query, key, value and output linear projections) required by a multihead attention layer:
(4) 
Encoderdecoder / sequencetosequence transformer [Vsp17].
This is the very first transformer. It was originally used for sequencetosequence tasks (machine translation), which is why it is more complicated than its successors.
The idea behind the architecture is as follows: First, the context sequence is encoded using bidirectional multihead attention. The output of this ‘encoder’ part of the network is a vector representation of each context token, taking into account the entire context sequence. Second, the primary sequence is encoded. Each token in the primary sequence is allowed to use information from the encoded context sequence, as well as primary sequence tokens that precede it. See LABEL:algo:EDTransformer for more details.
algocf[h]
Encoderonly transformer: BERT [Dclt19].
BERT is a bidirectional transformer trained on the task of masked language modelling. Given a piece of text with some tokens masked out, the goal is to correctly recover the maskedout tokens. The original use of BERT was to learn generally useful text representations, which could then be adapted for various downstream NLP tasks. The masking is not performed via the Mask parameter but differently: During training each input token is replaced with probability
by a dummy token mask_token, and evaluation is based on the reconstruction probability of these knockedout tokens (see Algorithm 9).The BERT architecture resembles the encoder part of the seq2seq transformer (hence ‘encoderonly’). It is described in detail in LABEL:algo:ETransformer
. It uses the GELU nonlinearity instead of ReLU:
(5) 
(When called with vector or matrix arguments, GELU is applied elementwise.)
algocf[h]
Decoderonly transformers: GPT2 [Rwc19], Gpt3 [Bmr20], Gopher [Rbc21].
GPT2 and GPT3 are large language models developed by OpenAI, and Gopher is a large language model developed by DeepMind. They all have similar architectures and are trained by autoregressive language modelling: Given an incomplete sentence or paragraph, the goal is to predict the next token.
The main difference from BERT is that GPT/Gopher use unidirectional attention instead of bidirectional attention; they also apply layernorms in slightly different order.
See LABEL:algo:DTransformer for the pseudocode of GPT2. GPT3 is identical except larger, and replaces dense attention in Line 6 by sparse attention, i.e. each token only uses a subset of the full context.
Gopher also deviates only slightly from the GPT2 architecture: it replaces layer norm in lines 5, 7 and 10 by RMSnorm (), and it uses different positional embeddings.
algocf[h]
Multidomain decoderonly transformer: Gato [Rzp22].
Gato is a multimodal multitask transformer built by DeepMind. It is a single neural network that can play Atari, navigate 3D environments, control a robotic arm, caption images, have conversations, and more.
Under the hood, each modality is converted into a sequence prediction problem by a separate tokenization and embedding method; for example images are divided into nonoverlapping patches, ordered in raster order (lefttoright, toptobottom) and processed by a ResNet block to obtain a vector representation.
The actual Gato architecture is then a decoderonly transformer like the one in LABEL:algo:DTransformer, but where Line 2 is replaced with modalityspecific embedding code.
7 Transformer Training and Inference
This section lists the pseudocode for various algorithms for training and using transformers:

EDTraining() Algorithm 8 shows how to train a sequencetosequence transformer (the original Transformer [VSP17]).

ETraining() Algorithm 9 shows how to train a transformer on the task of masked language modelling (like BERT [DCLT19]).

DTraining() Algorithm 10 shows how to train a transformer on the task of next token prediction (like CPTx [BMR20] and Gopher [RBC21]).

DInference() Algorithm 11 shows how to prompt a transformer trained on next token prediction (like GPTx [BMR20]). The temperature parameter interpolates between most likely continuation (), faithful sampling (, and uniform sampling .

EDInference() Algorithm 12 shows how to use a sequencetosequence transformer for prediction.
Gradient descent.
The described training Algorithms 8 to 10
use Stochastic Gradient Descent (SGD)
to minimize the log loss (aka cross entropy) as the update rule. Computation of the gradient is done via automatic differentiation tools; see [BPRS18, Table 5]. In practice, vanilla SGD is usually replaced by some more refined variation such as RMSProp or AdaGrad or others
[Rud16]. Adam [KB15] is used most often these days.8 Practical Considerations
While the vanilla transformers provided here may work in practice, a variety of “tricks” have been developed over the years to improve the performance of deep neural networks in general and transformers in particular [LWLQ21]:

Architecture: sparse layers, weight sharing (besides attention).

Training:
improved optimizers, minibatches, batch normalization, learning rate scheduling, weight initialization, pretraining, ensembling, multitask, adversarial (besides layer normalization)
[Sut15]. 
Inference: scratchpad prompting, fewshot prompting, chain of thought, majority voting [LAD22].

Others.
Appendix A References
 [Ala18] Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustratedtransformer/, 2018.
 [Ala19] Jay Alammar. The Illustrated GPT2 (Visualizing Transformer Language Models). http://jalammar.github.io/illustratedgpt2/, 2019.
 [Bel21] Fabrice Bellard. NNCP v2: Lossless Data Compression with Transformer. https://bellard.org/libnc/gpt2tc.html, 2021.
 [BFT17] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. NeurIPS, 2017.
 [BMR20] Tom Brown, Benjamin Mann, Nick Ryder, et al. Language models are fewshot learners. NeurIPS, 2020.
 [BPRS18] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic Differentiation in Machine Learning: A Survey. Journal of Machine Learning Research, 18(153):1–43, 2018.
 [DCLT19] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. ACL, 2019.
 [EGKZ21] Benjamin L. Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive Biases and Variable Creation in SelfAttention Mechanisms. arXiv:2110.10090 [cs, stat], October 2021.
 [Elh21] Nelson Elhage. A Mathematical Framework for Transformer Circuits. https://transformercircuits.pub/2021/framework/index.html, 2021.
 [EMK21] Yonathan Efroni, Dipendra Misra, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Provable RL with Exogenous Distractors via Multistep Inverse Dynamics, March 2021.
 [FGW21] Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, pages 968–988, 2021.
 [Gag94] Philip Gage. A new algorithm for data compression. Dr. Dobbs / C Users Journal, 12(2):23–38, 1994.
 [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018.
 [KB15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 [Ker21] Jonathan Kernes. Master Positional Encoding: Part I. https://towardsdatascience.com/masterpositionalencodingparti63c05d90a0c3, March 2021.
 [LAD22] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo GutmanSolo, Yuhuai Wu, Behnam Neyshabur, Guy GurAri, and Vedant Misra. Solving Quantitative Reasoning Problems with Language Models. arXiv:2206.14858 [cs], June 2022.
 [Lem21] Chris Lemke. Data preprocessing in NLP. https://towardsdatascience.com/datapreprocessinginnlpc371d53ba3e0, July 2021.
 [LWLQ21] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A Survey of Transformers, June 2021.
 [MBM20] Reza Moradi, Reza Berangi, and Behrouz Minaei. A survey of regularization strategies for deep models. Artificial Intelligence Review, 53(6):3947–3986, August 2020.
 [MKS13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning, December 2013.
 [MPCB14] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. NeurIPS, 2014.
 [PH22] M. Phuong and M. Hutter. Formal algorithms for transformers. Technical report, DeepMind, London, UK, 2022. LaTeX source available at http://arXiv.org
 [RBC21] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446, 2021.
 [Rud16] Sebastian Ruder. An overview of gradient descent optimization algorithms. https://ruder.io/optimizinggradientdescent/, January 2016.
 [RWC19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019.
 [RZP22] Scott Reed, Konrad Żołna, Emilio Parisotto, et al. A generalist agent. arXiv:2205.06175, 2022.
 [SBB18] Richard S. Sutton, Andrew G. Barto, and Francis Bach. Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts, second edition edition edition, November 2018.
 [SGBK21] Eren Sezener, Agnieszka GrabskaBarwińska, Dimitar Kostadinov, Maxime Beau, Sanjukta Krishnagopal, David Budden, Marcus Hutter, Joel Veness, Matthew Botvinick, Claudia Clopath, Michael Häusser, and Peter E. Latham. A rapid and efficient learning rule for biological neural circuits. Technical report, DeepMind, London, UK, 2021.
 [SHB16] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725. Association for Computational Linguistics (ACL), 2016.
 [Sut15] Ilya Sutskever. A Brief Overview of Deep Learning. http://yyue.blogspot.com/2015/01/abriefoverviewofdeeplearning.html, January 2015.
 [TZ22] Yingjie Tian and Yuqi Zhang. A comprehensive survey on regularization strategies in machine learning. Information Fusion, 80:146–166, April 2022.
 [VSP17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
Appendix B List of Notation
Symbol  Type  Explanation 
set of integers  
generic integer indices  
vocabulary  
vocabulary size  
set of token sequences; elements include e.g. sentences or documents  
maximum sequence length  
length of token sequence  
index of token in a sequence  
dimension of various vectors  
primary token sequence  
context token sequence  
entry of matrix  
th row of matrix  
th column of matrix  
vector representation / embedding of a token  
encoded primary token sequence  
encoded context token sequence  
masking matrix, it determines the attention context for each token  
number of network (encoder, decoder) layers  
index of network layer  
number of attention heads  
index of attention head  
(i.i.d.) sample size  
index of sample sequence  
learning rate  
temperature; it controls the diversityplausibility tradeoff at inference  
token embedding matrix  
positional embedding matrix  
unembedding matrix  
query weight matrix  
query bias  
key weight matrix  
key bias  
value weight matrix  
value bias  
collection of above parameters of a singlehead attention layer  
output weight matrix  
output bias  
collection of above parameters of a multihead attention layer, see eq. 4  
weight matrix corresponding to an MLP layer in a Transformer  
bias corresponding to an MLP layer in a Transformer  
layernorm learnable scale parameter  
layernorm learnable offset parameter  
collection of all learnable / learned Transformer parameters 