exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models

by   Benjamin Hoover, et al.
Harvard University

Large language models can produce powerful contextual representations that lead to improvements across many NLP tasks. Since these models are typically guided by a sequence of learned self attention mechanisms and may comprise undesired inductive biases, it is paramount to be able to explore what the attention has learned. While static analyses of these models lead to targeted insights, interactive tools are more dynamic and can help humans better gain an intuition for the model-internal reasoning process. We present exBERT, an interactive tool named after the popular BERT language model, that provides insights into the meaning of the contextual representations by matching a human-specified input to similar contexts in a large annotated dataset. By aggregating the annotations of the matching similar contexts, exBERT helps intuitively explain what each attention-head has learned.



There are no comments yet.


page 1

page 2

page 3

page 4


Visualizing Attention in Transformer-Based Language models

We present an open-source tool for visualizing multi-head self-attention...

VisBERT: Hidden-State Visualizations for Transformers

Explainability and interpretability are two important concepts, the abse...

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Breakthroughs in transformer-based models have revolutionized not only t...

Conditional Self-Attention for Query-based Summarization

Self-attention mechanisms have achieved great success on a variety of NL...

Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) and Targeted ASBA (TABSA) allow f...

Visualizing Attention in Transformer-Based Language Representation Models

We present an open-source tool for visualizing multi-head self-attention...

PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks based on the Transformer architecture have led to impressive improvements across many Natural Language Processing (NLP) tasks such as machine translation and text summarization [Vaswani et al., 2017]. The Transformer is based on subsequent application of “multi-head attention” to route the model reasoning, and this technique’s flexibility allows it to be pretrained on large corpora to generate contextual representations that can be used for other tasks.

Of these models, BERT is the most commonly used Transformer model for representation learning with several applications to transfer learning 

[Devlin et al., 2019]. BERT and it’s extensions [Sanh et al., 2019, Liu et al., 2019] are dominating the standard language understanding benchmarks [Wang et al., 2018, 2019]. Moreover, Transformer models have also had much success as autoregressively trained language models that can be used for generation tasks [Radford et al., 2019, Keskar et al., 2019].

It is not yet well-understood what information BERT encodes and how it uses the attention. To address this challenge, some research has focused on understanding whether BERT learns linguistic features such as Part Of Speech (POS), Dependency relationships (DEP), or Named Entity Recognition (NER) 

[e.g., Tenney et al., 2019b, Vig and Belinkov, 2019, Raganto and Tiedemann, 2018, Tenney et al., 2019a]. Clark et al. [2019] found that heads at different layers learn specific linguistic structure despite being trained in a completely unsupervised manner, although many heads ostensibly learn redundancies. Voita et al. [2019] further explore the nature of several specialized attention-heads to show that BERT depends on only a subset of the total heads and that overall model performance could be maintained when some heads were pruned.

The above analyses provide an in-depth but static glimpse into the behavior of transformers. Experiments for these analyses are often supported by open-source repositories that implement the newest architectures and thus enable rapid experimentation 

[Wolf et al., 2019]. We similarly need flexible evaluation frameworks for Transformer models that allow the community to test hypotheses rapidly. Toward that end, visualization tools can offer concise summaries of useful information and allow interaction with large models. Attention visualizations such as BertViz by Vig [2019] have taken large steps toward these goals by making exploration of BERT’s attention fast and interactive for the user. However, interpreting attention patterns without understanding the attended-to embeddings, or relying on attention alone for a faithful interpretation, can lead to faulty interpretations [Brunner et al., 2019, Jain and Wallace, 2019, Wiegreffe and Pinter, 2019].

To address this challenge, we developed exBERT, a tool that combines the advantages of a robust but static analysis with a dynamic and intuitive view into both the attention and internal representations of the underlying model.111exBERT is available at www.exbert.net. exBERT is agnostic to the underlying Transformer model and corpus and can thus be applied to different domains and languages. Similar to the static analysis by Clark et al. [2019], exBERT provides insights into both the attention and the token embeddings for the user-defined model and corpus by probing whether the representations capture metadata such as linguistic features or positional information.

Figure 1: An overview of the different components of the tool. The token “escape” is selected and masked at 0-[all]. The results from a corpus search by token embedding are shown and summarized in (d-g). Users can enter a sentence in (a) and modify the attention view through selections in (b). Self attention is displayed in (c). The blue matrices show the attention of a head (column) to a token (row). Tokens and heads that are selected in (c) can be searched over the annotated corpus (shown: Wizard of Oz) with results presented in (d). Every token in (d) displays its linguistic metadata on hover. A colored summary of the matched token (black highlight) and its context is shown in (e), which can be expanded or collapsed with the buttons above it. The histograms in (f) and (g) summarize the metadata of the results in (d) for the matched token and the token of max attention, respectively.

2 Background

2.1 Transformer Models

The Transformer model architecture as defined by Vaswani et al. [2017] relies on multiple sequential applications self attention layers. Self attention is the process by which each token within a sequence of inputs computes attention weights over all other tokens in the same input. Within the process, the inputs are projected into a key, query, and value representation , , and . The query and key representations are used to compute a weight for each token, which is then multiplied by that token’s value, such that the values for one attention head is defined as

Transformer models typically use of these self attention heads in parallel. Their outputs are concatenated and followed by a final linear projection. The output of this projection is used to calculate the token embedding used for the next layer.

2.2 Transformer Analysis

Previous analyses of transformer models focus on discovering how an unsupervised Transformer model learns to model our human understanding of language. For instance, Clark et al. [2019] showed that individual heads seem to recognize common POS and DEP relationships,e.g., Objects of the Preposition (POBJ), Determinants (DET), and Possessive Adjectives (POSS), with high fidelity. Vig and Belinkov [2019] also explored the dependency relations across heads and discovered that initial layers typically encode positional relations, middle layers capture the most dependency relations, and later layers look for unique patterns and structures. These insights are exposed visually and interactively through exBERT.

3 Overview

exBERT focuses on displaying a succinct view of both the attention and the internal representations of each token. The attention belonging to an input of length at a particular layer for a particular head can be understood as an matrix, where each row represents the attention out of the corresponding token in the input, and each column represents the attention into that token. This is conducive to a representation of curves pointing from each token to every other token. Representations, on the other hand, are best understood by comparing the embedding of a token to the embeddings of other tokens in an annotated corpus. The most similar token embeddings, defined by a nearest neighbor search, can be viewed in their corpus’ context in a language that humans can understand.

3.1 Components

Figure 1 shows an overview of the tool’s three main components. The Attention View provides an interactive view of the self attention of the model. Here, users can change layers, select heads, and view the aggregated attention. Tokens can be masked, and a selected token can be searched over the annotated corpus according to the methods laid out in Section 3.2. The results of this search are presented in the Corpus View, with the highest-similarity matches shown first. The Summary View shows histogram summaries of the matched metadata, which is useful for getting a snapshot of the metadata an embedding encodes in the searched corpus.

3.2 Searching

Inspired by Strobelt et al. [2017, 2018], exBERT

performs a nearest neighbor search of embeddings on a reference corpus that is processed with linguistic features as follows. First, the corpus is split by sentence, and its tokens are labeled for desired metadata (e.g., POS, DEP, NER). Searching by token embeddings performs a Cosine Similarity (CS) search with the tokens in the corpus 

[Johnson et al., 2019]. The top 50 matches are displayed and summarized for the user.

Searching by head embeddings also involves a CS search against the corpus but requires an extension of the self attention definition. In our case, we define the head embedding as

where is defined as the normalized representation of head at layer .

This normalization makes it possible to perform a CS search over the head embeddings in our preprocessed corpus. To search the corpus for only a select subset , we set all values of to in our query head embedding where .

4 Case Study: BERT

BERT is an instantiation of a Transformer model that can be used in applications that benefit from transfer learning [Devlin et al., 2019]. BERT introduces special tokens into the typical training process in addition to the input tokens that are extracted using Byte-Pair Encoding (BPE) [Sennrich et al., 2015]. The architecture requires every input to start with a “[CLS]” and end with a “[SEP]” token and uses a technique called Masked Language Modeling (MLM) to develop its language model [Devlin et al., 2019]

. MLM works by replacing random tokens in the training corpus with “[MASK]” and training the model to determine what word should belong. Since there is no information in the original vector embedding of “[MASK]”, BERT must rely on its internal representations to fill in the missing tokens. This provides an intuitive way to glimpse how BERT learns linguistic features in context.

In the following cases, the reference corpus used is the Wizard of Oz,222http://www.gutenberg.org/ebooks/55 which is annotated and processed by BERT to allow for nearest neighbor searching. Whenever BPE tokenization splits a single word into multiple tokens, we assign its metadata to each component token. Special tokens like “[CLS]” and “[SEP]” have no linguistic features assigned to them, and are therefore removed from the reference corpus, which allows searches to always match a token that has intuitive meaning for users. This also allows the tool to apply the same corpus to different transformer models that may require different tokenization.

We now explore the layers and heads at which BERT learns the linguistic features of a masked token. We look at the following sentence:

The girl ran to a local pub to escape the din of her city.

Unless otherwise noted, the following examples are from the BERT model, which has 12 layers and 12 heads per layer. We use the notation <layer>-<head> to refer to a single head at a single layer, and <layer>-[<heads>] to describe the cumulative attention of heads at a layer (e.g., 4-[0,3,9] to describe the sum of the attention of heads 0, 3, and 9 at layer 4). Note that we refer to layers and heads as 0 indexed in the tool.

4.1 Behind the mask

We begin by masking the “escape” token in the example sentence at layer 1 and search what information is behind the “[MASK]” token’s embedding. This setup is shown in Figure 1. Note that at this early layer, the matching embeddings are most similar to punctuation (PUNCT) and determinants (DET), which are the most common tokens in English (Figure 1f). Additionally, the maximum attention out of the MASKed token points to itself (Figure 1c). We can observe that there is no meaningful linguistic information encoded in the mask’s embedding at this layer.

As layers progress, more POS information is added to the token embedding. The summarized layer 5 search results (Figure 2a) show that we start to see some verb information creep into the embedding; however, it is not until layer 6 (Figure 2

b) that BERT is confident about the masked embedding being a verb. Subsequent layers seem to refine this estimation, with the embedding flirting with the possibility of the masked word being an adposition (ADP). At the final layer, BERT settles on the token being a VERB and wants to predict tokens such as “pass”, “mar”, “see”, “hear”, and “breathe”. For the complete progression, see Figure 

4 in Appendix A.

Figure 2: Left: searching by token embedding results. Histogram summaries shown at layers 5 (a) and 6 (b). Right: histogram summaries of searching by different head selections at layer 5.

4.2 Behind the heads

Searching by the masked token’s embedding helps show the information that is captured within the token itself, but it is useful to understand how the heads of the previous layer contributed to that information being encoded in the embedding. Going back to 5-[all], we see that the token embedding fails to embrace the masked “escape” token as a verb (Figure 2a). However, a search by head embedding at that point reveals that BERT has already learned to attend to sentence structures where the most similar tokens in the corpus are verbs (2c). Even at this early layer, it has learned to attend to the direct object (DOBJ) of that verb, a dependency that Clark et al. [2019] showed was strongest in head 8-9. Exploring other individual heads at this point for DEP relationships reveals that 5-3 primarily detects the ROOT dependency (2d) while 5-6 detects the AUX dependency (2e).

This is useful, but it is not clear how all the heads were able to maximize their attention on the “din” token and thus detect the DOBJ pattern that was in 18 of the top 50 matches in the search. Inspecting all the heads, it is clear that no head individually looks for DOBJ, and therefore that pattern must be detected through a combination of heads. Naively, we can strategically select the heads that maximize their attention to “din” (5-[0,4,5,7,8,9] shown in Figure 2f), but find that these most normally find the object of the preposition (POBJ). Further exploration shows that the DOBJ pattern can be detected by 5-[6,9] (2g), albeit with confusion to the DET dependency. It seems that complex dependencies like DOBJ can be detected in the early-middle layers of the model but rely on a combination of heads.

4.3 More than position

Figure 3: Exploration of positional heads, inspecting the positional head 2-0. The overview in (a) shows the head behavior always pointing to the following word, and the search token “of” is highlighted in (b). The matched results are summarized by POS and offset in (c) and by DEP in (d).

Figure 3a,c confirm that certain heads learn to attend to succeeding or preceding tokens. We call these heads positional heads in that they detect an offset from the current token [Clark et al., 2019]. Though simple, the positional head can encode important information about the attended-to word. Searching by head can reveal how much information from the token embeddings is visible to that head. A brief exploration of the attention in positional head 2-0 shows that the head is truly positional, matching the following word 50/50 times as seen in the lower histogram in Figure 3c. It also seems to match the POS belonging to the seed token (in this case, “of” is an ADP). The DEP summary at the bottom of Figure 3d additionally shows that not only does the head match the POS of the seed token, but it has also learned to look for cases where the word following a preposition is possessive (e.g., him/her/its). This kind of exploration shows how much information the different attention heads see from the tokens they attend to.

5 Conclusion

In this paper, we have introduced an interactive visualization, exBERT, that uses linguistic annotations, interactive masking, and nearest neighbor search to help revealing an intelligible structure about learned representations in transformer models. We demonstrate the applicability of exBERT to a specific case study for a BERT model across the Wizard of Oz corpus. The source code and demo are available at www.exbert.net, providing the community an opportunity to rapidly experiment with learned Transformer representations and gain a better understanding of what these models learn.

6 Acknowledgements

We thank Jesse Vig for his helpful feedback.


  • G. Brunner, Y. Liu, D. Pascual, O. Richter, and R. Wattenhofer (2019) On the validity of self-attention as explanation in transformer models. External Links: Link, 1908.04211 Cited by: §1.
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What does BERT look at? an analysis of bert’s attention. CoRR abs/1906.04341. External Links: Link, 1906.04341 Cited by: §1, §1, §2.2, §4.2, §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §4.
  • S. Jain and B. C. Wallace (2019) Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3543–3556. Cited by: §1.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §3.2.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.
  • A. Raganto and J. Tiedemann (2018) An analysis of encoder representations in transformer-based machine translation. Note: In EMNLP Workshop: BlackboxNLP External Links: Link Cited by: §1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. CoRR abs/1508.07909. External Links: Link, 1508.07909 Cited by: §4.
  • H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer, H. Pfister, and A. M. Rush (2018) Seq2Seq-vis: A visual debugging tool for sequence-to-sequence models. CoRR abs/1804.09299. External Links: Link, 1804.09299 Cited by: §3.2.
  • H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush (2017)

    LSTMVis: a tool for visual analysis of hidden state dynamics in recurrent neural networks

    IEEE transactions on visualization and computer graphics 24 (1), pp. 667–676. Cited by: §3.2.
  • I. Tenney, D. Das, and E. Pavlick (2019a) BERT rediscovers the classical NLP pipeline. CoRR abs/1905.05950. External Links: Link, 1905.05950 Cited by: §1.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick (2019b) What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1.
  • J. Vig and Y. Belinkov (2019) Analyzing the structure of attention in a transformer language model. CoRR abs/1906.04284. External Links: Link, 1906.04284 Cited by: §1, §2.2.
  • J. Vig (2019) A multiscale visualization of attention in the transformer model. CoRR abs/1906.05714. External Links: Link, 1906.05714 Cited by: §1.
  • E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. CoRR abs/1905.09418. External Links: Link, 1905.09418 Cited by: §1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Cited by: §1.
  • S. Wiegreffe and Y. Pinter (2019) Attention is not not explanation. External Links: Link, 1908.04626 Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) Transformers: state-of-the-art natural language processing. External Links: 1910.03771 Cited by: §1.

Appendix A Embedding results across layers

Figure 4: Token embeddings for the “escape” token setup in 4.1 across every layer. The matched tokens at the output of the model are shown in the bottom corpus inspector view, whereas summaries are shown for all the other layers