Log In Sign Up

Formal Algorithms for Transformers

by   Mary Phuong, et al.

This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (*not* results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models. The reader is assumed to be familiar with basic ML terminology and simpler neural network architectures such as MLPs.


page 1

page 2

page 3

page 4


Quantum Vision Transformers

We design and analyse quantum transformers, extending the state-of-the-a...

Transformers Implement First-Order Logic with Majority Quantifiers

Characterizing the implicit structure of the computation within neural n...

Language models and Automated Essay Scoring

In this paper, we present a new comparative study on automatic essay sco...

Relating transformers to models and neural representations of the hippocampal formation

Many deep neural network architectures loosely based on brain networks h...

Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19

This work explores speech as a biomarker and investigates the detection ...

On the interplay of adversarial robustness and architecture components: patches, convolution and attention

In recent years novel architecture components for image classification h...

The Logical Syntax of (IT) Architectures

Despite several (accepted) standards, core notions typically employed in...

1 Introduction

Transformers are deep feed-forward artificial neural networks with a (self)attention mechanism. They have been tremendously successful in natural language processing tasks and other domains. Since their inception 5 years ago

[VSP17], many variants have been suggested [LWLQ21]

. Descriptions are usually graphical, verbal, partial, or incremental. Despite their popularity, it seems no pseudocode has ever been published for any variant. Contrast this to other fields of computer science, even to “cousin” discipline reinforcement learning

[MKS13, SBB18, EMK21].

This report intends to rectify the situation for Transformers. It aims to be a self-contained, complete, precise and compact overview of transformer architectures and formal algorithms (but not results). It covers what Transformers are (Section 6), how they are trained (Section 7), what they’re used for (Section 3), their key architectural components (Section 5), tokenization (Section 4), and a preview of practical considerations (Section 8) and the most prominent models.

The essentially complete pseudocode is about 50 lines, compared to thousands of lines of actual real source code. We believe these formal algorithms will be useful for theoreticians who require compact, complete, and precise formulations, experimental researchers interested in implementing a Transformer from scratch, and encourage authors to augment their paper or text book with formal Transformer algorithms (Section 2).

The reader is assumed to be familiar with basic ML terminology and simpler neural network architectures such as MLPs.

In short, a (formally inclined) reader, upon understanding the contents of this document, will have a solid grasp of transformers: they will be ready to read and contribute to the literature on the topic as well as implement their own Transformer using the pseudocode as templates.

2 Motivation

The true story above the introduction describes quite well the feeling we have when browsing many Deep Learning (DL) papers; unable to figure out the algorithmic suggestions exactly. For practitioners, the papers may be sufficiently detailed, but the precision needed by theoreticians is usually higher. For reasons not entirely clear to us, the DL community seems shy of providing pseudocode for their neural network models. Below we describe the SOTA in DL paper writing and argue for the value of formal algorithms. The reader already convinced about their merit can without loss skip this section.

The lack of scientific precision and detail in DL publications.

Deep Learning has been tremendously successful in the last 5 to 10 years with thousands of papers published every year. Many describe only informally how they change a previous model, Some 100+ page papers contain only a few lines of prose informally describing the model [RBC21]. At best there are some high-level diagrams. No pseudocode. No equations. No reference to a precise explanation of the model. One may argue that most DL models are minor variations of a few core architectures, such as the Transformer [VSP17], so a reference augmented by a description of the changes should suffice. This would be true if (a) the changes were described precisely, (b) the reference architecture has been described precisely elsewhere, and (c) a reference is given to this description. Some if not all three are lacking in most DL papers. To the best of our knowledge no-one has even provided pseudocode for the famous Transformer and its encoder/decoder-only variations.

Interfacing algorithms.

Equally important are proper explanations of how these networks are trained and used, but sometimes it is not even clear what the inputs and outputs and potential side-effects of the described model are. Of course someone experienced would usually be able to correctly guess, but this is not a particularly scientific approach. The experimental section in publications often does not explain what is actually fed into the algorithms and how. If there is some explanation in the methods section, it is often disconnected from what is described in the experimental section, possibly due to different authors writing the different sections. The core algorithms for the models should be accompanied by the wrapper algorithms that call them, e.g. (pre)training, fine-tuning, prompting, inference, deployment. Sometimes this is simple, but sometimes the magic happens there. In any case, if these things are not formally stated they remain unclear. Again, if the setup is standard and has been formally explained elsewhere, a simple reference will do.

Source code vs pseudocode.

Providing open source code is very useful, but not a proper substitute for formal algorithms. There is a massive difference between a (partial) Python dump and well-crafted pseudocode. A lot of abstraction and clean-up is necessary: remove boiler plate code, use mostly single-letter variable names, replace code by math expressions wherever possible, e.g. replace loops by sums, remove (some) optimizations, etc. A well-crafted pseudocode is often less than a page and still essentially complete, compared to often thousands of lines of real source code. This is hard work no-one seems to be willing to do. Of course a forward process of first designing algorithms and write up pseudocode on paper, and then implementing it is fine too, but few DL practitioners seem to work that way.

Examples of good neural network pseudocode and mathematics and explanations.

Multi-Layer Perceptrons (MLPs) are usually well-described in many papers, e.g. 

[MPCB14, BFT17, JGH18], though also without pseudocode. For a rare text-book example of pseudocode for a non-trivial neural network architecture, see Algorithm S2 of [SGBK21], which constitutes a complete, i.e. essentially executable, pseudocode of just 25 lines based on a 350-line Python Colab toy implementation, which itself is based on a proper 1000+ line implementation.

This work aims to do the same for Transformers: The whole decoder-only Transformer GPT LABEL:algo:DTransformer based on attention Algorithms 5 and 4 and normalization Algorithm 6 including training Algorithm 10 and prompting and inference Algorithm 11 all-together is less than 50 lines of pseudocode, compared to e.g. the 2000-line self-contained C-implementation [Bel21].

[Ala19] is a great blog-post explaining Transformers and [EGKZ21] describes the attention mechanism to sufficient mathematical precision to allow proving properties about it, but neither provides pseudocode. [Elh21] is an attempt to understand Transformers by reverse-engineering the computations they perform and interpreting them as circuits.


But does anyone actually need pseudocode and what would they be useful for (we sometimes get asked)? We find the absence of pseudocode in DL and this question quite perplexing, but apparently it requires answering. Providing such pseudocode can be useful for many purposes:

  • They can be used as templates and adapted to precisely describe future variations, and therewith set a new standard in DL publishing. We explicitly encourage the reader to copy and adapt them to their needs and cite the original as “adapted from [PH22]”.

  • Having all that matters on one page in front of you makes it easier to develop new variations compared to reading prose or scrolling through 1000s of lines of actual code.

  • They can be used as a basis for new implementations from scratch, e.g. in different programming languages, without having to wade through and reverse-engineer existing idiosyncratic real source code.

  • They may establish notational convention, which eases communication and reading future variations.

  • The process of converting source code into pseudocode can exhibit implementation errors (as it e.g. did in [SGBK21]).

  • Theoreticians need compact, complete, and precise representations for reasoning and ultimately proving properties about algorithms. They are often unwilling or unable to reverse engineer code, or guess the meaning of words or fancy diagrams.

With this motivation in mind, the following five sections formally describe all aspects of transformer architectures, training, and inference.

3 Transformers and Typical Tasks

Transformers are neural network models that excel at natural language processing, or more generally at modelling sequential data. Two common types of tasks they are used for are sequence modelling and sequence-to-sequence prediction.


Let denote a finite set, called a vocabulary, often identified with . This could be words or letters, but typically are sub-words, called tokens. Let be a sequence of tokens, e.g. a sentence or a paragraph or a document. Unlike in Python, we use arrays starting from , and includes . For a matrix , we write for the th row and for the -th column. We use matrix

column vector convention more common in mathematics, compared to the default row vector

matrix in the transformer literature, i.e. our matrices are transposed. See Appendix B for a complete list of notation.


The predominant paradigm in machine learning is (still) learning from independent and identically distributed (i.i.d.) data. Even for sequence modelling for practical reasons this tradition is upheld. The training data may naturally be a collection of (independent) articles, but even then, some may exceed the maximal context length

transformers can handle. In this case, an article is crudely broken into shorter chunks of length .

Sequence modelling (DTransformer).

Given a vocabulary , let for be a dataset of sequences (imagined to be) sampled i.i.d. from some distribution over

. The goal is to learn an estimate

of the distribution

. In practice, the distribution estimate is often decomposed via the chain rule as

, where consists of all neural network parameters to be learned. The goal is to learn a distribution over a single token given its preceding tokens as context.

Examples include e.g. language modelling, RL policy distillation, or music generation.

Sequence-to-sequence (seq2seq) prediction (EDTransformer).

Given a vocabulary and an i.i.d. dataset of sequence pairs , where is a distribution over , learn an estimate of the conditional distribution . In practice, the conditional distribution estimate is often decomposed as .

Examples include translation ( a sentence in English, the same sentence in German), question answering ( question, the corresponding answer), text-to-speech ( a piece of text, a voice recording of someone reading the text).

Classification (ETransformer).

Given a vocabulary and a set of classes , let for be an i.i.d. dataset of sequence-class pairs sampled from . The goal in classification is to learn an estimate of the conditional distribution .

Examples include e.g. sentiment classification, spam filtering, toxicity classification.

4 Tokenization: How Text is Represented

In the context of natural language tasks, tokenization refers to how a piece of text such as “My grandma makes the best apple pie.” is represented as a sequence of vocabulary elements (called tokens).

Character-level tokenization.

One possible choice is to let be the English alphabet (plus punctuation). In the example above, we’d get a sequence of length 36: [‘M’, ‘y’, ‘  ’, …]. Character-level tokenization tends to yield very long sequences.

Word-level tokenization.

Another choice would be to let consist of all English words (plus punctuation). In the example above, we’d get a sequence of length 7: [‘My ’, ‘grandma ’, ‘makes ’, …]. Word-level tokenization tends to require a very large vocabulary and cannot deal with new words at test time.

Subword tokenization.

This is the method used in practice nowadays: is a set of commonly occurring word segments like ‘cious’, ‘ing’, ‘pre’. Common words like ‘is ’ are often a separate token, and single characters are also included in to ensure all words are expressible.

There are in fact many ways to do subword tokenization. One of the simplest and most successful ones is Byte Pair Encoding [Gag94, SHB16]

used in GPT-2


Final vocabulary and text representation.

Given a choice of tokenization / vocabulary, each vocabulary element is assigned a unique index in . A number of special tokens are then added to the vocabulary. The number of special tokens varies, and here we will consider three: , used in masked language modelling (see Algorithm 9); , used for representing the beginning of sequence; and , used for representing the end of sequence. The complete vocabulary has elements.

A piece of text is represented as a sequence of indices (called token IDs) corresponding to its (sub)words, preceded by bos_token and followed by eos_token.

5 Architectural Components

The following are the neural network building blocks (functions with learnable parameters) from which transformers are made. Full architectures featuring these building blocks are presented in the next section. (By a slight abuse of notation, we identify with the set .)

Token embedding.

The token embedding learns to represent each vocabulary element as a vector in ; see Algorithm 1.

Input: , a token ID.
Output: , the vector representation of the token.
Parameters: , the token embedding matrix.
Algorithm 1 Token embedding.

Positional embedding.

The positional embedding learns to represent a token’s position in a sequence as a vector in . For example, the position of the first token in a sequence is represented by a (learned) vector , the position of the second token is represented by another (learned) vector , etc. The purpose of the positional embedding is to allow a Transformer to make sense of word ordering; in its absence the representation would be permutation invariant and the model would perceive sequences as “bags of words” instead.

Learned positional embeddings require that input sequence length is at most some fixed number (the size of the learned positional embedding matrix must be finite and fixed in advance of training). An intuitive explanation of how this works can be found at [Ala18]. For pseudocode, see Algorithm 2.

Not all transformers make use of learned positional embeddings, some use a hard-coded mapping instead [Ker21]. Such hard-coded positional embeddings can (theoretically) handle arbitrarily long sequences. The original Transformer [VSP17] uses

for .

Input: , position of a token in the sequence.
Output: , the vector representation of the position.
Parameters: , the positional embedding matrix.
Algorithm 2 Positional embedding.

The positional embedding of a token is usually added to the token embedding to form a token’s initial embedding. For the -th token of a sequence , the embedding is



Attention is the main architectural component of transformers. It enables a neural network to make use of contextual information (e.g. preceding text or the surrounding text) for predicting the current token.

On a high level, attention works as follows: the token currently being predicted is mapped to a query vector , and the tokens in the context are mapped to key vectors and value vectors . The inner products are interpreted as the degree to which token is important for predicting the current token – they are used to derive a distribution over the context tokens, which is then used to combine the value vectors. An intuitive explanation how this achieves attention can be found at [Ala18, Ala19]. The precise algorithm is given in Algorithm 3.

Input: , vector representation of the current token
Input: , vector representations of context tokens .
Output: , vector representation of the token and context combined.
Parameters: , , the query and key linear projections.
Parameters: , , the value linear projection.
Algorithm 3 Basic single-query attention.

There are many ways the basic attention mechanism is used in transformers. We list some of the most common variants below.

It will be useful to define the softmax function for matrix arguments, as well as a Mask matrix:

/* Computes a single (masked) self- or cross- attention head. */
Input: , vector representations of primary and context sequence.
Output: , updated representations of tokens in , folding in information from tokens in .
Parameters: consisting of: , , ,  .
Hyperparameters: Mask, (3)
    [[]]         [[]]       [[]]                     [[]] if Mask then return
Algorithm 4  Attention

Bidirectional / unmasked self-attention.

Given a sequence of token representations, this variant applies attention to each token, treating all tokens in the sequence as the context. See Algorithm 4, called with token sequence and no masking (Mask≡1).

Unidirectional / masked self-attention.

Given a sequence of token representations, this variant applies attention to each token, treating all preceding tokens (including itself) as the context. Future tokens are masked out, so this causal auto-regressive version can be used for online prediction. See Algorithm 4, called with token sequence and Mask. For this Mask, the output only depends on , hence can be used to predict .


Given two sequences of token representations (often in the context of a sequence-to-sequence task), this variant applies attention to each token of the primary token sequence , treating the second token sequence as the context. See Algorithm 4, called with Mask≡1. While the output and input sequences have the same length , the context sequence can have different length .

Multi-head attention.

The attention algorithm presented so far (Algorithm 4) describes the operation of a single attention head. In practice, transformers run multiple attention heads (with separate learnable parameters) in parallel and combine their outputs; this is called multi-head attention; see Algorithm 5

/* Computes Multi-Head (masked) self- or cross- attention layer. */
Input: , vector representations of primary and context sequence.
Output: , updated representations of tokens in , folding in information from tokens in .
Hyperparameters: , number of attention heads
Hyperparameters: Mask, (3)
Parameters: consisting of
   For , consisting of:
       , ,
       , ,
       , .
   , .
1 For
Algorithm 5  MHAttention

Layer normalisation.

Layer normalisation explicitly controls the mean and variance of individual neural network activations; the pseudocode is given in

Algorithm 6. Some transformers use a simpler and more computationally efficient version of layer normalization setting , called root mean square layer normalisation, or RMSnorm.

/* Normalizes layer activations . */
Input: , neural network activations.
Output: , normalized activations.
Parameters: , element-wise scale and offset.
return , where denotes element-wise multiplication.
Algorithm 6  layer_norm


The unembedding learns to convert a vector representation of a token and its context into a distribution over the vocabulary elements; see Algorithm 7. The algorithm describes an independently learned unembedding matrix, but note that sometimes the unembedding matrix is instead fixed to be the transpose of the embedding matrix.

Input: , a token encoding.

, a probability distribution over the vocabulary.

Parameters: , the unembedding matrix.
Algorithm 7 Unembedding.

6 Transformer Architectures

This section presents a few prominent transformer architectures, based on attention Algorithms 5 and 4 and using normalization Algorithm 6, in roughly historical order:

  • EDT [VSP17] The original sequence-to-sequence / Encoder-Decoder Transformer, Algorithms 12, 8 and LABEL:algo:EDTransformer.

  • BERT [DCLT19], which is an instance of an encoder-only transformer (encoder-only means that it is derived from the encoder-decoder architecture by dropping the decoder part), Algorithms 9 and LABEL:algo:ETransformer.

  • GPT [RWC19, BMR20], which is an instance of a decoder-only transformer, Algorithms 11, 10 and LABEL:algo:DTransformer.

While the main architectural difference between BERT and GPT is in attention masking, they also differ in a number of less important ways: e.g. they use different activation functions and the layer-norms are positioned differently. We included these differences in the pseudocode to stay faithful to the original algorithms, but note that different transformer architectures may adopt these selectively.

To simplify notation, we denote by the entire set of parameters (query, key, value and output linear projections) required by a multi-head attention layer:


Encoder-decoder / sequence-to-sequence transformer [Vsp17].

This is the very first transformer. It was originally used for sequence-to-sequence tasks (machine translation), which is why it is more complicated than its successors.

The idea behind the architecture is as follows: First, the context sequence is encoded using bidirectional multi-head attention. The output of this ‘encoder’ part of the network is a vector representation of each context token, taking into account the entire context sequence. Second, the primary sequence is encoded. Each token in the primary sequence is allowed to use information from the encoded context sequence, as well as primary sequence tokens that precede it. See LABEL:algo:EDTransformer for more details.


Encoder-only transformer: BERT [Dclt19].

BERT is a bidirectional transformer trained on the task of masked language modelling. Given a piece of text with some tokens masked out, the goal is to correctly recover the masked-out tokens. The original use of BERT was to learn generally useful text representations, which could then be adapted for various downstream NLP tasks. The masking is not performed via the Mask parameter but differently: During training each input token is replaced with probability

by a dummy token mask_token, and evaluation is based on the reconstruction probability of these knocked-out tokens (see Algorithm 9).

The BERT architecture resembles the encoder part of the seq2seq transformer (hence ‘encoder-only’). It is described in detail in LABEL:algo:ETransformer

. It uses the GELU nonlinearity instead of ReLU:


(When called with vector or matrix arguments, GELU is applied element-wise.)


Decoder-only transformers: GPT-2 [Rwc19], Gpt-3 [Bmr20], Gopher [Rbc21].

GPT-2 and GPT-3 are large language models developed by OpenAI, and Gopher is a large language model developed by DeepMind. They all have similar architectures and are trained by autoregressive language modelling: Given an incomplete sentence or paragraph, the goal is to predict the next token.

The main difference from BERT is that GPT/Gopher use unidirectional attention instead of bidirectional attention; they also apply layer-norms in slightly different order.

See LABEL:algo:DTransformer for the pseudocode of GPT-2. GPT-3 is identical except larger, and replaces dense attention in Line 6 by sparse attention, i.e. each token only uses a subset of the full context.

Gopher also deviates only slightly from the GPT-2 architecture: it replaces layer norm in lines 5, 7 and 10 by RMSnorm (), and it uses different positional embeddings.


Multi-domain decoder-only transformer: Gato [Rzp22].

Gato is a multi-modal multi-task transformer built by DeepMind. It is a single neural network that can play Atari, navigate 3D environments, control a robotic arm, caption images, have conversations, and more.

Under the hood, each modality is converted into a sequence prediction problem by a separate tokenization and embedding method; for example images are divided into non-overlapping patches, ordered in raster order (left-to-right, top-to-bottom) and processed by a ResNet block to obtain a vector representation.

The actual Gato architecture is then a decoder-only transformer like the one in LABEL:algo:DTransformer, but where Line 2 is replaced with modality-specific embedding code.

7 Transformer Training and Inference

This section lists the pseudocode for various algorithms for training and using transformers:

  • EDTraining() Algorithm 8 shows how to train a sequence-to-sequence transformer (the original Transformer [VSP17]).

  • ETraining() Algorithm 9 shows how to train a transformer on the task of masked language modelling (like BERT [DCLT19]).

  • DTraining() Algorithm 10 shows how to train a transformer on the task of next token prediction (like CPT-x [BMR20] and Gopher [RBC21]).

  • DInference() Algorithm 11 shows how to prompt a transformer trained on next token prediction (like GPT-x [BMR20]). The temperature parameter interpolates between most likely continuation (), faithful sampling (, and uniform sampling .

  • EDInference() Algorithm 12 shows how to use a sequence-to-sequence transformer for prediction.

Gradient descent.

The described training Algorithms 8 to 10

use Stochastic Gradient Descent (SGD)

to minimize the log loss (aka cross entropy) as the update rule. Computation of the gradient is done via automatic differentiation tools; see [BPRS18, Table 5]

. In practice, vanilla SGD is usually replaced by some more refined variation such as RMSProp or AdaGrad or others

[Rud16]. Adam [KB15] is used most often these days.

/* Training a seq2seq model */
Input: , a dataset of sequence pairs.
Input: , initial transformer parameters.
Output: , the trained parameters.
1 for  do
2       for  do
4       end for
6 end for
Algorithm 8  EDTraining()
/* Training by masked language modelling */
Input: , a dataset of sequences.
Input: , initial encoder-only transformer parameters.
Output: , the trained parameters.
1 for  do
2       for  do
3             for  do
4                   randomly with probability or
5             end for
7       end for
9 end for
Algorithm 9  ETraining()
/* Training next token prediction */
Input: , a dataset of sequences.
Input: , initial decoder-only transformer parameters.
Output: , the trained parameters.
1 for  do
2       for  do
4       end for
6 end for
Algorithm 10  DTraining()
/* Prompting a trained model and using it for prediction. */
Input: Trained transformer parameters .
Input: , a prompt.
Output: , the transformer’s continuation of the prompt.
1 for  do
2       sample a token from
3 end for
Algorithm 11  DInference()
/* Using a trained seq2seq model for prediction. */
Input: A seq2seq transformer and trained parameters of the transformer.
Input: , input sequence, e.g. a sentence in English.
Output: , output sequence, e.g. the sentence in German.
1 while  do
2       sample a token from
3 end while
Algorithm 12  EDInference()

8 Practical Considerations

While the vanilla transformers provided here may work in practice, a variety of “tricks” have been developed over the years to improve the performance of deep neural networks in general and transformers in particular [LWLQ21]:

  • Data preprocessing: cleaning, augmentation [FGW21], adding noise, shuffling [Lem21] (besides tokenization and chunking).

  • Architecture: sparse layers, weight sharing (besides attention).

  • Training:

    improved optimizers, minibatches, batch normalization, learning rate scheduling, weight initialization, pre-training, ensembling, multi-task, adversarial (besides layer normalization)


  • Regularization: weight decay, early stopping, cross-validation, dropout, adding noise [MBM20, TZ22].

  • Inference: scratchpad prompting, few-shot prompting, chain of thought, majority voting [LAD22].

  • Others.

Appendix A References

Appendix B List of Notation

Symbol Type Explanation
set of integers
generic integer indices
vocabulary size
set of token sequences; elements include e.g. sentences or documents
maximum sequence length
length of token sequence
index of token in a sequence
dimension of various vectors
   primary token sequence
   context token sequence
entry of matrix
-th row of matrix
-th column of matrix
vector representation / embedding of a token
encoded primary token sequence
encoded context token sequence
masking matrix, it determines the attention context for each token
number of network (encoder, decoder) layers
index of network layer
number of attention heads
index of attention head
(i.i.d.) sample size
index of sample sequence
learning rate
temperature; it controls the diversity-plausibility trade-off at inference
token embedding matrix
positional embedding matrix
unembedding matrix
query weight matrix
query bias
key weight matrix
key bias
value weight matrix
value bias
collection of above parameters of a single-head attention layer
output weight matrix
output bias
collection of above parameters of a multi-head attention layer, see eq. 4
weight matrix corresponding to an MLP layer in a Transformer
bias corresponding to an MLP layer in a Transformer
layer-norm learnable scale parameter
layer-norm learnable offset parameter
collection of all learnable / learned Transformer parameters