1 Related Work
We discuss related work from two perspectives: (1) visual analytic approaches for interpreting and interacting with various deep neural networks and (2) interpretation and analysis of self-attention networks mainly in NLP domains.
Regarding the former, various visual analytic approaches have been proposed for convolutional neural networks mainly computer vision domains[liu16cnnvis, zeiler2014visualizing, Pezzotti17DeepEyes, Liu18DeepTracker, alsall2017hierachy, revacnnfilm] and RNNs in NLP domains [ming2017rnnvis, strobelt2018lstmvis, strobelt_seq2seq-vis_2019, cashman_rnnbow_2018, kwon_retainvis_2019]. Visual analytic approaches have also been integrated with other advanced neural network architectures, such as generative adversarial networks [kahng18GanLab, wang2018GANViz]
, deep reinforcement learning[wang2019dqnvis]. Among them, Strobelt et al. [hendrik2018seq] developed a visual analytic system for RNN-based attention models, mainly for the exploration and understanding of sequence-to-sequence modeling tasks. However, despite the success of multi-head self-attention networks, such as BERT [jacob2018bert] and XLNet [Yang2019XLNetGA], visual analytic approaches for these advanced attention networks have not existed before.
In NLP domains, recent studies [jacob2018bert, Vig2019AnalyzingTS, Clark2019WhatDB] have analyzed diverse behaviors of different attention heads in a self-attention model and have drawn linguistic interpretations as to what kind of syntactic and/or semantic features each attention head captures. Another line of research [Voita2019AnalyzingMS, Strubell2018LinguisticallyInformedSF] have attempted to leverage insights obtained from such in-depth analysis to improve the prediction accuracy and computational efficiency by, say, removing unnecessary heads and refining them. However, these approaches have not properly utilized the potential of interactive visualization, so in this respect, our work is one of the first sophisticated visual analytics systems for self-attention networks.
2 Preliminaries: Self-Attention Networks
This section briefly reviews the self-attention module originally proposed in Transformer [vasw2017transformer]. Transformer adopts an encoder-decoder architecture to solve sequence-to-sequence learning tasks, e.g., neural machine translation, which converts a sentence in one language into that in another language. It converts a sequence of words in one domain into that in another domain. For example, for machine translation tasks, it translates a sentence in one language into that in another language. In this process, the encoder of Transformer converts input words (e.g., English words) to internal, hidden-state vectors, and the decoder turns the vectors into a sequence of output words (e.g., French words).
Each encoder and decoder respectively consists of multiple layers of computing functions inside. Furthermore, each layer in the encoder includes two sequential sub-layers, which are a multi-head self-attention and a position-wise feed-forward network. In addition to the multi-layer architecture of the encoder, the decoder has an additional attention layer, which is called as an encoder-decoder attention and helps the model to give attention to the encoders’ internal states. Each layer of both encoder and decoder also consists of skip-connection and layer normalization in their computation pipeline. Overall encoder and decoder architecture are the stacks of identical encoder layers or decoder layers, including an embedding layer.
We summarize the computation process with mathematical notations, so readers are advised to read the remaining section for details: Let us denote as the size of hidden state vector and as the number of heads in multi-head self-attention. Each dimension of query, key, and value vector is .
The embedding layer transforms the input token to its embedding space using a word embedding and adds the position information for each input token using sinusoidal functions (see Steps 1 and 2 in Figure 1), where is the -th input token in .
At each attention head, we transform encoded word vectors into three matrices of a query, a key, and a value, , , and , respectively, for times, which in turn generated
matrices, using the linear transformation and compute the attention-weighted combinations of value vectors as
where , and , and indicate the linear transformation matrices at the -th head. In multi-head self-attention, which consists of parallel attention heads, transformation matrices of each head are randomly initialized, and then each set is used to project input vectors onto a different representation subspace. For this reason, every attention head is allowed to have different attention shapes and patterns. This characteristic encourages each head differently to attend adjacent words or linguistics relation words.
In the encoder layer, source words (input words to the encoder) work as the input to the query, key, and value transformations at the -th head. In the decoder layer, the input can vary by attention types. While the decoders’ self-attention takes target words (output words of the decoder) as its input, the encoder-decoder attention has target words as input to a query transformation but source words as the input to a key and a value transformation.
3 Design Rationale
We consider our design rationale of SANVis as follows:
R1: Grasping the information flow across multiple layers.
R2: Identifying and making sense of attention patterns of each attention head.
R3: Visualizing syntactic and semantic information to allow of exploring the attention head in their query and key vectors.
We present SANVis,111Our demo is available at http://short.sanvis.org. a visual analytics system for deep understanding of self-attention models, as shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks. SANVis is composed of the network view and the HeadLens view. (1) The network view allows the user to understand the overall information flow through our visualization across the multiple layers (T1). Moreover, this view provides additional visualization options that assist the user in distinguishing distinct patterns from multiple attention patterns within a layer (T2). (2) The HeadLens view reveals the characteristics of the query and the key vectors and their relationship of a particular head (T3).
4.1 Network View
Network view mainly visualizes the overview of attention patterns across multiple layers using the Sankey diagram (T1). Additionally, this view supports ‘piling’ and ‘sorting’ capabilities to understand common as well as distinct attention patterns among multiple attention heads (T2). For example, one can replace the Sankey diagram with a multiple heatmap view, where multiple heatmaps corresponding to different heads can be sorted by several different criteria (Figure, SANVis: Visual Analytics for Understanding Self-Attention Networks (A-3)). The attention piling view aggregates multiple attention patterns into a small number of clusters (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-1)).
As shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A), a set of words are sequentially aligned vertically in each layer, and represented the histogram according to attention weights from multiple heads. In Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-4), each bar corresponds to a particular head within the layer where its height represents the total amount of attention weights assigned to those words by a specific head. As with Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-2), if the fourth head in the layer attended to the word ‘planet’ more highly than others, the fourth bar would be higher than the others. In this manner, the user easily recognizes which heads highly attend those words based on the height of histogram bars. Furthermore, when the user moves the mouse over the particular color bar, we show an attention heat map of the corresponding head in that layer.
Sankey diagram. As shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(A-2), the edge weight between them represents the average attention weight across multiple heads within a particular layer. In this figure, we can see the strong link between ‘planet’ in layer 2 to the preposition words(‘on’), the same word and that article(‘the’) in layer 3. It means a significant amount of information of ‘planet’ in layer 2 is conveyed to encode each word of a phrase (“on the planet”) in layer 3. This pattern shows the model captures the context meaning of the word, which is defined as the linguistic phrase in the given sentence, for improving the quality of translation.
Attention Sorting. Figure 2 shows various attention patterns between query (y-axis) and key (x-axis) words for the different attention head in different layers, where a gray diagonal line indicates the position of attending itself for detecting attention patterns.
We focus on reducing the users’ efforts to find the distinguish attention patterns by using sorting algorithms, which is based on relative positional information and the entropy value in the attention (Figure 2). Relative positional information, such as whether the attention goes mainly toward the left, right, or the current location, as well as the column-wise mean entropy value of the attention matrix, allow the users to detect these patterns easily.
Figure 3 shows the sorted results of attention patterns based on our position or entropy sorting algorithms. When sorted by position, a number of attention is unambiguous that attention that inclines towards the past words are placed near the control panel at the top while those that lean towards the future words are placed relatively close to the bottom. When sorted by entropy, the uppermost attention has the lowest entropy and exhibits bar-shaped attention, which numerous query words attend the same word. At the bottom, the user can find that no more words focused on the same word.
Attention Piling. Inspired by the heatmap piling methods [bach2015smallmultiples, Henry2007nodetrix], we applied this piling idea to summarize multiple attention patterns in a single layer, as shown in the encoder part of Figure 4. To this end, we compute the feature vector of each attention head and perform clustering to the form of piles (or clusters) of multiple attention patterns.
The feature vector of a particular attention on the attention head is defined as a flattened -dimensional vector of its attention matrix, where is calculated from on the -th head, concatenated with additional three-dimensional vector of (1) the sum of the upper triangular part of the matrix, (2) that of the lower-triangular part, and (3) the sum of diagonal entries. This three-dimensional vector indicates the proportions how much attention is assigned to (1) the previous words of a query word, (2) its next words, (3) and the query itself, respectively.
Using these feature vectors of multiple attention heads within a single layer, we perform hierarchical clustering based on their Euclidean distances. In this manner, multiple attention patterns are grouped, forming an aggregated heatmap visualization per computed pile along with head indices belonging to each pile, as shown in Figure4. It helps the user to easily find the similar patterns and distinct patterns in the same layer by adjusting Euclidean distance.
To analyze a particular attention head, SANVis offers a novel view called the HeadLens, as shown in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B). This view facilitates detailed analysis of the query and key representations of the selected attention head, such as which linguistic or positional feature they encoded (T3). This view opens when a user clicks a particular heatmap corresponding to an attention head in the network view.
The HeadLens is generated as follows. (1) It performs clustering on query and key vectors separately. (2) For each pair of a query and a key cluster centroids, its pairwise similarity is computed, forming a heatmap visualization in Figur SANVis: Visual Analytics for Understanding Self-Attention Networks (B-2). (3) Additionally, the POS tagging and the positional information is summarized for each of the query and the key clusters (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3)). (4) Once a user clicks a particular cell in a heatmap, its corresponding query and a key cluster are summarized in terms of their representative keywords (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-4)).
To be specific, in the first step, we consider all the sentences in a validation set and obtain the query and the key vectors of all the words from these sentences. These query and key vectors are the results of applying a query and a key transformation of input words for a given attention head. Next, we perform the -means++ [K_plus_mean] algorithm for each of the above-described query and key vector sets, by using the pre-defined number of clusters, e.g., 16 in our case. We empirically set this number of clusters by using an elbow method.
In the second step, we obtain the cluster centroid vectors from the set of clusters for query vectors as well as those centroid vectors for key vectors. In the third step, we compute all the pairwise inner product similarities between each pair of a query cluster centroid and a key cluster one, which are visualized as a heatmap (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks (B-2)). We choose the inner product as a similarity measure since the attention weight is mainly computed based on the inner product between a query and a key vector. Within each heatmap, the color of cells is marked red(or blue) if it has high(or low) similarity. High inner product between a query set and a key set means that the words in the query set are likely to attend to the words in the key set.
In the third step, the HeadLens provides a summary of each of the query and the key clusters. Each query (or key) cluster contains those words whose query (or key) vectors belong to the cluster. For those words, we obtain their part-of-speech (POS) tags and position indices within the sentence which each of them appears in. For POS tags, we used universal POS tagger [stafordCorenlp]. Afterward, the relative amount of those words with each POS tag type out of the entire words within a single cluster is shown as a horizontal bar width with its encoded color, as shown in the left bar of Figure SANVis: Visual Analytics for Understanding Self-Attention Networks (B-3). In addition, the relative amount of those words shown in a particular position of their original sentences are color-coded (a higher value colored as a red), as shown in the right bar of Figure SANVis: Visual Analytics for Understanding Self-Attention Networks (B-3).
Finally, in the fourth step, the user can click a particular entry in the cluster-level heatmap, e.g., a pair of a query and a key cluster with high similarity (a red cell highlighted in a black square in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-2)). Then, the summary of the corresponding query and key cluster are indicated by a black-colored edge (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3)). Additionally, the word cloud visualization of such user-selected query and key cluster are used to highlight the frequently appearing words in each cluster, color-coded with their own POS tag types (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-4)).
For example, the selected entry in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-2) indicates that the query cluster 15 has high similarity with the key cluster 15. The selected query cluster mainly contains auxiliary verb words (orange-colored), while the selected key cluster mainly contains noun words (purple-colored) in Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3). Furthermore, their most appearing words are shown in the word cloud view (Figure SANVis: Visual Analytics for Understanding Self-Attention Networks(B-3)), which means that this head assigns a high attention weight to these noun words, e.g., ‘world’ and ‘life’ when a query word is given as an auxiliary verb word, e.g., ‘is’ and ‘are.’ This result shows the selected head have captured the linguistic relationship of noun and auxiliary verb.
5 Usage Scenarios
This section demonstrates usage scenarios of SANVis, mainly focusing on the recently proposed Transformer. This model has shown superior performances in machine translation tasks, including English-French and English-German translation tasks in the WMT challenger [noauthor_acl_nodate]. Our implementation of the Transformer is based on the annotated Transformer [opennmt]. Our model parameter setting followed the base model in the original paper [vasw2017transformer]. We set our target task as English-French translation, where the collection of the scripts from TED lectures is used as our dataset [paul2018personalnmt]. The BLEU score of our model is shown as 38.4, which validates a reasonable level of performance. For evaluating our system, we used the validation set, which is not seen during training.
Attention Piling. In Figure 4, the encoder-decoder part shows the attention piling visualization in encoder-decoder attention. In this example, one can observe that a number of attention heads have a diagonal attention pattern. An appropriate explanation of this diagonal shape would be that the words in French and English are generally aligned in the same order [bahdnamu2014seq2attn]. For the debugging purpose, it proves that the model properly learned a linguistic alignment between the source sequence and the corresponding target sequence.
HeadLens. In the earlier example, we saw that most attention patterns between the English and the French words are diagonally-shaped between English and French words. One can analyze this pattern in detail by using our HeadLens. As shown in Figure 5, we chose head 4 in layer 7, which has such a diagonal attention pattern, and applied the HeadLens. Once selecting the query and the key cluster pair with a high similarity (Figure 5 (A)), it is shown that the query clusters commonly have an pronoun as a dominant POS tag type (brown-colored in Figure 5 (B)). Most of query cluster words are subject words in French, for instance, ‘nous’ and ‘vous’ mean ‘we’ and ‘you’ in English, respectively. The corresponding key clusters’ representative words are mostly verbs. This result demonstrates that the model attends verb words to predict verb tokens for translating from English to French, when the input token is subject.
In this paper, we present SANVis, a visual analytics system for self-attention networks, which supports in-depth understanding of multi-head self-attention networks at different levels of granularity. Using several usage scenarios, we demonstrate that our system provides the user with a deep understanding of the multi-head self-attention model in machine translation.
As future work, we plan to extend our HeadLens to perform clustering of value vectors. We evaluate our system by various researchers who use the multi-head self-attention networks. We also apply our method in other state-of-the-art self-attention based models, such as BERT [jacob2018bert] and XLNet [Yang2019XLNetGA].