Transformer Neural Network

What is a Transformer Neural Network?

The transformer is a component used in many neural network designs for processing sequential data, such as natural language text, genome sequences, sound signals or time series data. Most applications of transformer neural networks are in the area of natural language processing.

A transformer neural network can take an input sentence in the form of a sequence of vectors, and converts it into a vector called an encoding, and then decodes it back into another sequence.

An important part of the transformer is the attention mechanism. The attention mechanism represents how important other tokens in an input are for the encoding of a given token. For example, in a machine translation model, the attention mechanism allows the transformer to translate words like ‘it’ into a word of the correct gender in French or Spanish by attending to all relevant words in the original sentence.

Crucially, the attention mechanism allows the transformer to focus on particular words on both the left and right of the current word in order to decide how to translate it. Transformer neural networks replace the earlier recurrent neural network (RNN), long short term memory (LSTM), and gated recurrent (GRU) neural network designs.

Transformer Neural Network Design

The transformer neural network receives an input sentence and converts it into two sequences: a sequence of word vector embeddings, and a sequence of positional encodings.

The word vector embeddings are a numeric representation of the text. It is necessary to convert the words to the embedding representation so that a neural network can process them. In the embedding representation, each word in the dictionary is represented as a vector. The positional encodings are a vector representation of the position of the word in the original sentence.

The transformer adds the word vector embeddings and positional encodings together and passes the result through a series of encoders, followed by a series of decoders. Note that in contrast to RNNs and LSTMs, the entire input is fed into the network simultaneously rather than sequentially.

The encoders each convert their input into another sequence of vectors called encodings. The decoders do the reverse: they convert the encodings back into a sequence of probabilities of different output words. The output probabilities can be converted into another natural language sentence using the softmax function.

Each encoder and decoder contains a component called the attention mechanism, which allows the processing of one input word to include relevant data from certain other words, while masking the words which do not contain relevant information.

Because this must be calculated many times, we implement multiple attention mechanisms in parallel, taking advantage of the parallel computing offered by GPUs. This is called the multi-head attention mechanism. The ability to pass multiple words through a neural network simultaneously is one advantage of transformers over LSTMs and RNNs.

The architecture of a transformer neural network. In the original paper, there were 6 encoders chained to 6 decoders.

Positional Encoding in the Transformer Neural Network

Many other neural network designs, such as LSTMs, use a vector embedding in order to convert words to values that can be fed into a neural network. Every word in the vocabulary is mapped to a constant vector value. For example:

However, a word can have different meanings in different contexts. Compare “I went to the bank” to “I swam to the bank”.

The transformer design adds an extra sinusoidal function to this vector which allows the word vector embedding to vary depending on its position in a sentence. For example,

where w is the index of the word in the sentence.

This allows the neural network to retain some information about the words’ relative positions after the input vectors have been propagated through the layers. Note that the positional encoding alone does not disambiguate the different senses of a word, but rather it serves as a way to transmit information about the order of the sentence to the attention mechanisms.

Attention Mechanism in the Transformer Neural Network

The most important part of a transformer neural network is the attention mechanism. The attention mechanism addresses the question of which parts of the input vector the network should focus on when generating the output vector.

This is very important in translation. For example, the English “the red house” corresponds to “la casa roja” in Spanish: the two languages have different word orders.

The attention mechanisms allow a decoder, while it is generating an output word, to focus more on relevant words or hidden states within the network, and focus less on irrelevant information.

As a simplified example, when translating “the red house” to Spanish, the attention vector for the first output word could be as follows:

In practice attention is used in three different ways in a transformer neural network:

(1) Encoder-decoder attention, as in the above example. An attention mechanism allowing a decoder to attend over the input sequence when generating the output sequence.

(2) Self-attention in the encoder. This allows an encoder to attend to all parts of the encoding output from the previous encoder.

(3) Self-attention in the decoder. This allows a decoder to attend to all parts of the sequence inside the decoder.

The attention mechanisms allow a model to draw information from input words and hidden states at any other point in the sentence.

Taking this further, we can generate a matrix showing the strength of the attention vector between each word in the source language and target language:

Above: the alignment matrix of a translation from English to Spanish.

Attention Formula in the Transformer Neural Network

The attention mechanism function is like a fuzzy dictionary lookup: it takes a query and a set of key-value pairs, and outputs a weighted sum of the values that correspond to the keys that are most similar to the query. The attention function allows the transformer neural network to focus on a subset of its input vectors.

The most common formula for attention in a transformer neural network is the scaled dot-product attention:

The mathematical definition of the scaled dot-product attention function

Attention Formula Symbols Explained

	A vector of queries of dimension d_k
	A vector of keys of dimension d_k
	A vector of values of dimension d_k
	The size of the attention keys. This is a hyperparameter chosen at design time.

Note that Q, K and V can come from different sources depending on where the attention mechanism is used in the transformer (self-attention or encoder-decoder attention).

The attention computation is parallelized in the multi-head attention mechanism so that we can calculate the attention for multiple positions in the sentence simultaneously. This is done by concatenating the vectors in the above formula for multiple positions.

Calculating Attention in the Transformer Neural Network

Let us consider the case of attention key size 3 and the below values for the keys, the values and the query. Note that the query is identical to the second key, so we expect the attention function to return the second row of V.

First, we do the matrix multiplication

Now we calculate the scaled attention logits

Putting through the softmax function, we obtain

Note that we have retrieved the second value from the value matrix. The attention function performed a lookup, found that the query matched the second key, and returned the second value. In practice, the query will normally match a weighted combination of keys and the attention function returns a weighted average of the corresponding values.

Inside the transformer neural network, the attention mechanism can appear as self-attention, where Q, K and V all take the same value, or as encoder-decoder attention, where Q is taken from the previous decoder layer and K and V come from the encoder layer.

Transformer Neural Network vs RNN

RNNs have a fundamentally different design from transformers. An RNN processes the input words one by one, and maintains a hidden state vector over time. Every input word is passed through several layers of the neural network and modifies the state vector. In theory, at a given time the state vector could retain information about inputs from far in the past. However usually the hidden state of the model conserves little usable information about early inputs. New inputs can easily overwrite a state, causing information loss. This means that the performance of an RNN tends to degrade over long sentences. This is called the long-term dependency problem.

This contrasts with the transformer design, where the entire input sequence is processed at the same time and the attention mechanism allows each output word to draw from each input and hidden state.

Because RNNs process the input sequence sequentially, it is hard to take advantage of high-performance computing such as GPUs. The transformer design, with its parallel processing and multi-head attention mechanisms, allows for much faster training and execution, since the different input words can be processed simultaneously on a GPU.

Transformer Neural Network vs LSTM

LSTMs are a special kind of RNN which has been very successful for a variety of problems such as speech recognition, translation, image captioning, text classification and more. They were explicitly designed to deal with the long-term dependency problem faced by standard RNNs, but use a very different approach from the transformer design.

The core idea of the LSTM design is the cell state. This is a hidden state that is maintained over time as the LSTM receives the input tokens. Since an LSTM is a kind of recurrent neural network, it receives the inputs one by one. However, in addition to the standard RNN design, the LSTM carefully regulates the ability to alter the information in the hidden cell state by means of structures called ‘gates’. In the standard LSTM design there are three gates, called the ‘input gate’, the ‘output gate’ and the ‘forget gate’.

In the example of translating a sentence from English to Spanish, when a new subject noun is encountered, when the LSTM encounters a new subject noun, the forget gate might erase the gender of the previous subject and the input gate might store the gender of the new subject.

The elaborate gated design of the LSTM partly solves the long-term dependency problem. However the LSTM, being a recurrent design, must still be trained and executed sequentially. This means that dependencies can flow from left to right, rather than in both directions as in the case of the transformer's attention mechanism. Furthermore, its recurrent design still makes it hard to use parallel computing and this means that LSTMs are very slow to train.

Before the development of the transformer architecture, many researchers added attention mechanisms to LSTMs, which improved performance over the basic LSTM design. The transformer neural network was born from the discovery that the recurrent design, with sequential word input, was no longer necessary, and the attention mechanism alone could deliver improved accuracy. This paved the way for the parallel design of the transformer which enables training on high performance devices such as GPUs.

Applications of Transformer Neural Networks

Transformer neural networks are useful for many sequence-related deep learning tasks, such as machine translation (as described above), information retrieval, text classification, document summarization, image captioning, and genome analysis.

Transformer Neural Networks in Information Retrieval

From 2019, Google Search has begun to use Google’s transformer neural network BERT for search queries in over 70 languages.

Prior to this change, a lot of information retrieval was keyword based, meaning Google checked its crawled sites without strong contextual clues. Take the example word ‘bank’, which can have many meanings depending on the context.

The introduction of transformer neural networks to Google Search means that queries where words such as ‘from’ or ‘to’ affect the meaning are better understood by Google. Users can search in more natural English rather than adapting their search query to what they think Google will understand.

An example from Google’s blog is the query “2019 brazil traveler to usa need a visa.” The position of the word ‘to’ is very important for the correct interpretation of the query. The previous implementation of Google Search was not able to pick up this nuance and returned results about USA citizens traveling to Brazil, whereas the transformer model returns much more relevant pages.

A further advantage of the transformer architecture is that learning in one language can be transferred to other languages via transfer learning. Google was able to take the trained English model and adapt it easily for the other languages’ Google Search.

Transformer Neural Networks for Text Generation

OpenAI have demonstrated how their transformer models GPT-2 and GPT-3 can generate extremely humanlike texts.

In their paper Fine-Tuning Language Models from Human Preferences, OpenAI introduced reinforcement learning instead of supervised learning to train a transformer neural network to generate text. In this set-up, the transformer neural network receives a ‘reward’ if it generates a continuation of the story which is judged pleasing to human readers.

One concern voiced by many is the possibility of this high-quality text generation being used for malicious purposes, such as generating fake news or offensive content. OpenAI are exploring the possibilities of using reinforcement learning as a kind of safety measure, allowing humans to ensure that a text generation model does not start producing offensive output, for example. This is a real concern since an incident in 2016 when Microsoft’s machine learning chatbot Tay was hijacked by malicious actors and began to output offensive texts.

History of Transformer Neural Networks

Transformer neural networks and the attention mechanism were first proposed by a Google-led team in 2017 in a widely cited paper titled Attention Is All You Need. Before the invention of the transformer, sequence-related tasks were mainly handled with variations on recurrent neural networks (RNNs).

RNNs were invented by David Rumelhart in 1986 but have severe limitations for practical use in their original form, because when they are being trained on long sequences, gradients tend to explode out of control or vanish to nothing. This is known as the vanishing gradient problem and the exploding gradient problem. This problem was partly solved by the introduction of the long short term memory neural network (LSTM), and the gated recurrent unit (GRU), which were modifications of the original RNN design. Both LSTM and GRU use components similar to logic gates to remember information from the beginning of a sequence and avoid vanishing and exploding gradients.

From 2007 onwards, LSTM and GRU began to revolutionize speech recognition and machine translation. However, their main limitations were that they are slow to train, due to lack of parallelization, and they do not use all surrounding context when encoding a word. Both of these concerns were addressed by the authors of Attention Is All You Need with the introduction of the multi-headed attention mechanism. The breakthrough in their paper was the insight that if the network is based on the attention mechanism, then it is no longer necessary to have a recurrent architecture, paving the way for more stable models that are easier to train.

In 2018, Google open-sourced BERT, a language model in TensorFlow based on transformers, and in 2019 OpenAI partly released GPT-2, based on a slightly different transformer architecture. GPT-2 made headlines because OpenAI stated that they would not release a trained model for fear of it being used for malicious purposes such as fake news.