SANVis: Visual Analytics for Understanding Self-Attention Networks

by   Cheonbok Park, et al.

Attention networks, a deep neural network architecture inspired by humans' attention mechanism, have seen significant success in image captioning, machine translation, and many other applications. Recently, they have been further evolved into an advanced approach called multi-head self-attention networks, which can encode a set of input vectors, e.g., word vectors in a sentence, into another set of vectors. Such encoding aims at simultaneously capturing diverse syntactic and semantic features within a set, each of which corresponds to a particular attention head, forming altogether multi-head attention. Meanwhile, the increased model complexity prevents users from easily understanding and manipulating the inner workings of models. To tackle the challenges, we present a visual analytics system called SANVis, which helps users understand the behaviors and the characteristics of multi-head self-attention networks. Using a state-of-the-art self-attention model called Transformer, we demonstrate usage scenarios of SANVis in machine translation tasks. Our system is available at


Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Transformer-based models have brought a radical change to neural machine...

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Impressive performance of Transformer has been attributed to self-attent...

Alignment Attention by Matching Key and Query Distributions

The neural attention mechanism has been incorporated into deep neural ne...

Peripheral Vision Transformer

Human vision possesses a special type of visual processing systems calle...

PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused ...

Convolutional Self-Attention Networks

Self-attention networks (SANs) have drawn increasing interest due to the...

Quantum Self-Attention Neural Networks for Text Classification

An emerging direction of quantum computing is to establish meaningful qu...

1 Related Work

We discuss related work from two perspectives: (1) visual analytic approaches for interpreting and interacting with various deep neural networks and (2) interpretation and analysis of self-attention networks mainly in NLP domains.

Regarding the former, various visual analytic approaches have been proposed for convolutional neural networks mainly computer vision domains 

[liu16cnnvis, zeiler2014visualizing, Pezzotti17DeepEyes, Liu18DeepTracker, alsall2017hierachy, revacnnfilm] and RNNs in NLP domains [ming2017rnnvis, strobelt2018lstmvis, strobelt_seq2seq-vis_2019, cashman_rnnbow_2018, kwon_retainvis_2019]. Visual analytic approaches have also been integrated with other advanced neural network architectures, such as generative adversarial networks [kahng18GanLab, wang2018GANViz]

, deep reinforcement learning 

[wang2019dqnvis]. Among them, Strobelt et al. [hendrik2018seq] developed a visual analytic system for RNN-based attention models, mainly for the exploration and understanding of sequence-to-sequence modeling tasks. However, despite the success of multi-head self-attention networks, such as BERT [jacob2018bert] and XLNet [Yang2019XLNetGA], visual analytic approaches for these advanced attention networks have not existed before.

In NLP domains, recent studies [jacob2018bert, Vig2019AnalyzingTS, Clark2019WhatDB] have analyzed diverse behaviors of different attention heads in a self-attention model and have drawn linguistic interpretations as to what kind of syntactic and/or semantic features each attention head captures. Another line of research [Voita2019AnalyzingMS, Strubell2018LinguisticallyInformedSF] have attempted to leverage insights obtained from such in-depth analysis to improve the prediction accuracy and computational efficiency by, say, removing unnecessary heads and refining them. However, these approaches have not properly utilized the potential of interactive visualization, so in this respect, our work is one of the first sophisticated visual analytics systems for self-attention networks.

Figure 2: Diverse attention patterns found in the encoder of Transformer. Some attention heads show diagonal patterns indicating that a query word attends to itself (1) or its immediate previous (5) or next word (3). Some other attention heads attend to a common, single word (2). In other attention heads, each group of consecutive words attends commonly to a single word within that group (4).

2 Preliminaries: Self-Attention Networks

This section briefly reviews the self-attention module originally proposed in Transformer [vasw2017transformer]. Transformer adopts an encoder-decoder architecture to solve sequence-to-sequence learning tasks, e.g., neural machine translation, which converts a sentence in one language into that in another language. It converts a sequence of words in one domain into that in another domain. For example, for machine translation tasks, it translates a sentence in one language into that in another language. In this process, the encoder of Transformer converts input words (e.g., English words) to internal, hidden-state vectors, and the decoder turns the vectors into a sequence of output words (e.g., French words).

Each encoder and decoder respectively consists of multiple layers of computing functions inside. Furthermore, each layer in the encoder includes two sequential sub-layers, which are a multi-head self-attention and a position-wise feed-forward network. In addition to the multi-layer architecture of the encoder, the decoder has an additional attention layer, which is called as an encoder-decoder attention and helps the model to give attention to the encoders’ internal states. Each layer of both encoder and decoder also consists of skip-connection and layer normalization in their computation pipeline. Overall encoder and decoder architecture are the stacks of identical encoder layers or decoder layers, including an embedding layer.

We summarize the computation process with mathematical notations, so readers are advised to read the remaining section for details: Let us denote as the size of hidden state vector and as the number of heads in multi-head self-attention. Each dimension of query, key, and value vector is .

The embedding layer transforms the input token to its embedding space using a word embedding and adds the position information for each input token using sinusoidal functions (see Steps 1 and 2 in Figure 1), where is the -th input token in .

At each attention head, we transform encoded word vectors into three matrices of a query, a key, and a value, , , and , respectively, for times, which in turn generated

matrices, using the linear transformation and compute the attention-weighted combinations of value vectors as


where , and , and indicate the linear transformation matrices at the -th head. In multi-head self-attention, which consists of parallel attention heads, transformation matrices of each head are randomly initialized, and then each set is used to project input vectors onto a different representation subspace. For this reason, every attention head is allowed to have different attention shapes and patterns. This characteristic encourages each head differently to attend adjacent words or linguistics relation words.

In the encoder layer, source words (input words to the encoder) work as the input to the query, key, and value transformations at the -th head. In the decoder layer, the input can vary by attention types. While the decoders’ self-attention takes target words (output words of the decoder) as its input, the encoder-decoder attention has target words as input to a query transformation but source words as the input to a key and a value transformation.

Figure 3: Attention sorting result. The user can sort a set of multiple attention patterns with respect to different criteria such as the entropy measure (A) and the relative positional offset from query words (B).

3 Design Rationale

We consider our design rationale of SANVis as follows:

R1: Grasping the information flow across multiple layers.

R2: Identifying and making sense of attention patterns of each attention head.

R3: Visualizing syntactic and semantic information to allow of exploring the attention head in their query and key vectors.

4 SANVis

We present SANVis,111Our demo is available at a visual analytics system for deep understanding of self-attention models, as shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks. SANVis is composed of the network view and the HeadLens view. (1) The network view allows the user to understand the overall information flow through our visualization across the multiple layers (T1). Moreover, this view provides additional visualization options that assist the user in distinguishing distinct patterns from multiple attention patterns within a layer (T2). (2) The HeadLens view reveals the characteristics of the query and the key vectors and their relationship of a particular head (T3).

4.1 Network View

Network view mainly visualizes the overview of attention patterns across multiple layers using the Sankey diagram (T1). Additionally, this view supports ‘piling’ and ‘sorting’ capabilities to understand common as well as distinct attention patterns among multiple attention heads (T2). For example, one can replace the Sankey diagram with a multiple heatmap view, where multiple heatmaps corresponding to different heads can be sorted by several different criteria (Figure, SANVis: Visual Analytics for Understanding Self-Attention Networks (A-3)). The attention piling view aggregates multiple attention patterns into a small number of clusters (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-1)).

As shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A), a set of words are sequentially aligned vertically in each layer, and represented the histogram according to attention weights from multiple heads. In Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-4), each bar corresponds to a particular head within the layer where its height represents the total amount of attention weights assigned to those words by a specific head. As with Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-2), if the fourth head in the layer attended to the word ‘planet’ more highly than others, the fourth bar would be higher than the others. In this manner, the user easily recognizes which heads highly attend those words based on the height of histogram bars. Furthermore, when the user moves the mouse over the particular color bar, we show an attention heat map of the corresponding head in that layer.

Sankey diagram. As shown in Figure  SANVis: Visual Analytics for Understanding Self-Attention Networks(A-2), the edge weight between them represents the average attention weight across multiple heads within a particular layer. In this figure, we can see the strong link between ‘planet’ in layer 2 to the preposition words(‘on’), the same word and that article(‘the’) in layer 3. It means a significant amount of information of ‘planet’ in layer 2 is conveyed to encode each word of a phrase (“on the planet”) in layer 3. This pattern shows the model captures the context meaning of the word, which is defined as the linguistic phrase in the given sentence, for improving the quality of translation.

Attention Sorting. Figure 2 shows various attention patterns between query (y-axis) and key (x-axis) words for the different attention head in different layers, where a gray diagonal line indicates the position of attending itself for detecting attention patterns.

We focus on reducing the users’ efforts to find the distinguish attention patterns by using sorting algorithms, which is based on relative positional information and the entropy value in the attention (Figure 2). Relative positional information, such as whether the attention goes mainly toward the left, right, or the current location, as well as the column-wise mean entropy value of the attention matrix, allow the users to detect these patterns easily.

Figure 3 shows the sorted results of attention patterns based on our position or entropy sorting algorithms. When sorted by position, a number of attention is unambiguous that attention that inclines towards the past words are placed near the control panel at the top while those that lean towards the future words are placed relatively close to the bottom. When sorted by entropy, the uppermost attention has the lowest entropy and exhibits bar-shaped attention, which numerous query words attend the same word. At the bottom, the user can find that no more words focused on the same word.

Figure 4: Attention piling example in the encoder layer and encoder-decoder layer. In the encoder-decoder example, piling results do not have a gray diagonal line because of the difference between the count of query words and key words.

Attention Piling. Inspired by the heatmap piling methods [bach2015smallmultiples, Henry2007nodetrix], we applied this piling idea to summarize multiple attention patterns in a single layer, as shown in the encoder part of Figure 4. To this end, we compute the feature vector of each attention head and perform clustering to the form of piles (or clusters) of multiple attention patterns.

The feature vector of a particular attention on the attention head is defined as a flattened -dimensional vector of its attention matrix, where is calculated from on the -th head, concatenated with additional three-dimensional vector of (1) the sum of the upper triangular part of the matrix, (2) that of the lower-triangular part, and (3) the sum of diagonal entries. This three-dimensional vector indicates the proportions how much attention is assigned to (1) the previous words of a query word, (2) its next words, (3) and the query itself, respectively.

Using these feature vectors of multiple attention heads within a single layer, we perform hierarchical clustering based on their Euclidean distances. In this manner, multiple attention patterns are grouped, forming an aggregated heatmap visualization per computed pile along with head indices belonging to each pile, as shown in Figure 

4. It helps the user to easily find the similar patterns and distinct patterns in the same layer by adjusting Euclidean distance.

4.2 HeadLens

Figure 5: HeadLens showing the encoder-decoder attention of head 7 in layer 4.

To analyze a particular attention head, SANVis offers a novel view called the HeadLens, as shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B). This view facilitates detailed analysis of the query and key representations of the selected attention head, such as which linguistic or positional feature they encoded (T3). This view opens when a user clicks a particular heatmap corresponding to an attention head in the network view.

The HeadLens is generated as follows. (1) It performs clustering on query and key vectors separately. (2) For each pair of a query and a key cluster centroids, its pairwise similarity is computed, forming a heatmap visualization in Figur SANVis: Visual Analytics for Understanding Self-Attention Networks (B-2). (3) Additionally, the POS tagging and the positional information is summarized for each of the query and the key clusters (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3)). (4) Once a user clicks a particular cell in a heatmap, its corresponding query and a key cluster are summarized in terms of their representative keywords (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-4)).

To be specific, in the first step, we consider all the sentences in a validation set and obtain the query and the key vectors of all the words from these sentences. These query and key vectors are the results of applying a query and a key transformation of input words for a given attention head. Next, we perform the -means++ [K_plus_mean] algorithm for each of the above-described query and key vector sets, by using the pre-defined number of clusters, e.g., 16 in our case. We empirically set this number of clusters by using an elbow method.

In the second step, we obtain the cluster centroid vectors from the set of clusters for query vectors as well as those centroid vectors for key vectors. In the third step, we compute all the pairwise inner product similarities between each pair of a query cluster centroid and a key cluster one, which are visualized as a heatmap (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks (B-2)). We choose the inner product as a similarity measure since the attention weight is mainly computed based on the inner product between a query and a key vector. Within each heatmap, the color of cells is marked red(or blue) if it has high(or low) similarity. High inner product between a query set and a key set means that the words in the query set are likely to attend to the words in the key set.

In the third step, the HeadLens provides a summary of each of the query and the key clusters. Each query (or key) cluster contains those words whose query (or key) vectors belong to the cluster. For those words, we obtain their part-of-speech (POS) tags and position indices within the sentence which each of them appears in. For POS tags, we used universal POS tagger [stafordCorenlp]. Afterward, the relative amount of those words with each POS tag type out of the entire words within a single cluster is shown as a horizontal bar width with its encoded color, as shown in the left bar of Figure SANVis: Visual Analytics for Understanding Self-Attention Networks (B-3). In addition, the relative amount of those words shown in a particular position of their original sentences are color-coded (a higher value colored as a red), as shown in the right bar of Figure SANVis: Visual Analytics for Understanding Self-Attention Networks (B-3).

Finally, in the fourth step, the user can click a particular entry in the cluster-level heatmap, e.g., a pair of a query and a key cluster with high similarity (a red cell highlighted in a black square in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-2)). Then, the summary of the corresponding query and key cluster are indicated by a black-colored edge (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3)). Additionally, the word cloud visualization of such user-selected query and key cluster are used to highlight the frequently appearing words in each cluster, color-coded with their own POS tag types (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-4)).

For example, the selected entry in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-2) indicates that the query cluster 15 has high similarity with the key cluster 15. The selected query cluster mainly contains auxiliary verb words (orange-colored), while the selected key cluster mainly contains noun words (purple-colored) in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3). Furthermore, their most appearing words are shown in the word cloud view (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3)), which means that this head assigns a high attention weight to these noun words, e.g., ‘world’ and ‘life’ when a query word is given as an auxiliary verb word, e.g., ‘is’ and ‘are.’ This result shows the selected head have captured the linguistic relationship of noun and auxiliary verb.

5 Usage Scenarios

This section demonstrates usage scenarios of SANVis, mainly focusing on the recently proposed Transformer. This model has shown superior performances in machine translation tasks, including English-French and English-German translation tasks in the WMT challenger [noauthor_acl_nodate]. Our implementation of the Transformer is based on the annotated Transformer [opennmt]. Our model parameter setting followed the base model in the original paper [vasw2017transformer]. We set our target task as English-French translation, where the collection of the scripts from TED lectures is used as our dataset [paul2018personalnmt]. The BLEU score of our model is shown as 38.4, which validates a reasonable level of performance. For evaluating our system, we used the validation set, which is not seen during training.

Attention Piling. In Figure 4, the encoder-decoder part shows the attention piling visualization in encoder-decoder attention. In this example, one can observe that a number of attention heads have a diagonal attention pattern. An appropriate explanation of this diagonal shape would be that the words in French and English are generally aligned in the same order [bahdnamu2014seq2attn]. For the debugging purpose, it proves that the model properly learned a linguistic alignment between the source sequence and the corresponding target sequence.

HeadLens. In the earlier example, we saw that most attention patterns between the English and the French words are diagonally-shaped between English and French words. One can analyze this pattern in detail by using our HeadLens. As shown in Figure 5, we chose head 4 in layer 7, which has such a diagonal attention pattern, and applied the HeadLens. Once selecting the query and the key cluster pair with a high similarity (Figure 5 (A)), it is shown that the query clusters commonly have an pronoun as a dominant POS tag type (brown-colored in Figure 5 (B)). Most of query cluster words are subject words in French, for instance, ‘nous’ and ‘vous’ mean ‘we’ and ‘you’ in English, respectively. The corresponding key clusters’ representative words are mostly verbs. This result demonstrates that the model attends verb words to predict verb tokens for translating from English to French, when the input token is subject.

6 Conclusions

In this paper, we present SANVis, a visual analytics system for self-attention networks, which supports in-depth understanding of multi-head self-attention networks at different levels of granularity. Using several usage scenarios, we demonstrate that our system provides the user with a deep understanding of the multi-head self-attention model in machine translation.

As future work, we plan to extend our HeadLens to perform clustering of value vectors. We evaluate our system by various researchers who use the multi-head self-attention networks. We also apply our method in other state-of-the-art self-attention based models, such as BERT [jacob2018bert] and XLNet [Yang2019XLNetGA].

The authors wish to thank all reviewers who provided constructive feedback for our project. This work was partially supported by NCSOFT NLP Center. This work was also supported by the National Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No. NRF-2018M3E3A1057305).